Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Burak Yavuz
Could you please open a JIRA for it? The maxBins input is missing for the
Python Api.

Is it possible if you can use the current master? In the current master,
you should be able to use trees with the Pipeline Api and DataFrames.

Best,
Burak

On Wed, May 20, 2015 at 2:44 PM, Don Drake  wrote:

> I'm running Spark v1.3.1 and when I run the following against my dataset:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3)
>
> The job will fail with the following message:
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 553, in trainRegressor
> loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 438, in _train
> loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 120, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 113, in callJavaFunc
> return _java2py(sc, func(*args))
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o69.trainGradientBoostedTreesModel.
> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
> requires maxBins (= 32) >= max categories in categorical features (= 1895)
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> at
> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>
> So, it's complaining about the maxBins, if I provide maxBins=1900 and
> re-run it:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3, maxBins=1900)
>
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catF
> eatures, maxDepth=6, numIterations=3, maxBins=1900)
> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>
> It now says it knows nothing of maxBins.
>
> If I run the same command against DecisionTree or RandomForest (with
> maxBins=1900) it works just fine.
>
> Seems like a bug in GradientBoostedTrees.
>
> Suggestions?
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> 800-733-2143
>


Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for!
http://spark-packages.org/package/tresata/spark-sorted

Best,
Burak

On Mon, May 4, 2015 at 12:56 PM, Imran Rashid  wrote:

> oh wow, that is a really interesting observation, Marco & Jerry.
> I wonder if this is worth exposing in combineByKey()?  I think Jerry's
> proposed workaround is all you can do for now -- use reflection to
> side-step the fact that the methods you need are private.
>
> On Mon, Apr 27, 2015 at 8:07 AM, Saisai Shao 
> wrote:
>
>> Hi Marco,
>>
>> As I know, current combineByKey() does not expose the related argument
>> where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is
>> package private, if you can get the ShuffledRDD through reflection or other
>> way, the keyOrdering you set will be pushed down to shuffle. If you use a
>> combination of transformations to do it, the result will be same but the
>> efficiency may be different, some transformations will separate into
>> different stages, which will introduce additional shuffle.
>>
>> Thanks
>> Jerry
>>
>>
>> 2015-04-27 19:00 GMT+08:00 Marco :
>>
>>> Hi,
>>>
>>> I'm trying, after reducing by key, to get data ordered among partitions
>>> (like RangePartitioner) and within partitions (like sortByKey or
>>> repartitionAndSortWithinPartition) pushing the sorting down to the
>>> shuffles machinery of the reducing phase.
>>>
>>> I think, but maybe I'm wrong, that the correct way to do that is that
>>> combineByKey call setKeyOrdering function on the ShuflleRDD that it
>>> returns.
>>>
>>> Am I wrong? Can be done by a combination of other transformations with
>>> the same efficiency?
>>>
>>> Thanks,
>>> Marco
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: DataFrame filter referencing error

2015-04-30 Thread Burak Yavuz
Is "new" a reserved word for MySQL?

On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella <
francesco.bigare...@gmail.com> wrote:

> Do you know how I can check that? I googled a bit but couldn't find a
> clear explanation about it. I also tried to use explain() but it doesn't
> really help.
> I still find unusual that I have this issue only for the equality operator
> but not for the others.
>
> Thank you,
> F
>
> On Wed, Apr 29, 2015 at 3:03 PM ayan guha  wrote:
>
>> Looks like you DF is based on a MySQL DB using jdbc, and error is thrown
>> from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark
>> is pushing down the predicate to mysql so its not a spark problem perse
>>
>> On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella <
>> francesco.bigare...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I was testing the DataFrame filter functionality and I found what I
>>> think is a strange behaviour.
>>> My dataframe testDF, obtained loading aMySQL table via jdbc, has the
>>> following schema:
>>> root
>>>  | -- id: long (nullable = false)
>>>  | -- title: string (nullable = true)
>>>  | -- value: string (nullable = false)
>>>  | -- status: string (nullable = false)
>>>
>>> What I want to do is filter my dataset to obtain all rows that have a
>>> status = "new".
>>>
>>> scala> testDF.filter(testDF("id") === 1234).first()
>>> works fine (also with the integer value within double quotes), however
>>> if I try to use the same statement to filter on the status column (also
>>> with changes in the syntax - see below), suddenly the program breaks.
>>>
>>> Any of the following
>>> scala> testDF.filter(testDF("status") === "new")
>>> scala> testDF.filter("status = 'new'")
>>> scala> testDF.filter($"status" === "new")
>>>
>>> generates the error:
>>>
>>> INFO scheduler.DAGScheduler: Job 3 failed: runJob at
>>> SparkPlan.scala:121, took 0.277907 s
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 3.0 (TID 12, ):
>>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column
>>> 'new' in 'where clause'
>>>
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>>> at com.mysql.jdbc.Util.getInstance(Util.java:386)
>>> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
>>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625)
>>> at
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119)
>>> at
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283)
>>> at org.apache.spark.sql.jdbc.JDBCRDD$anon$1.(JDBCRDD.scala:328)
>>> at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Does filter work only on columns of the integer type? What is the exact
>>> behaviour of the filter function and what is the best way to handle the
>>> query I am trying to execute?
>>>
>>> Thank you,
>>> Francesco
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>


Fwd: Change ivy cache for spark on Windows

2015-04-27 Thread Burak Yavuz
+user

-- Forwarded message --
From: Burak Yavuz 
Date: Mon, Apr 27, 2015 at 1:59 PM
Subject: Re: Change ivy cache for spark on Windows
To: mj 


Hi,

In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set:

`spark.jars.ivy \your\path`


Best,
Burak

On Mon, Apr 27, 2015 at 1:49 PM, mj  wrote:

> Hi,
>
> I'm having trouble using the --packages option for spark-shell.cmd - I have
> to use Windows at work and have been issued a username with a space in it
> that means when I use the --packages option it fails with this message:
>
> "Exception in thread "main" java.net.URISyntaxException: Illegal character
> in path at index 13: C:/Users/My Name/.ivy2/jars/spark-csv_2.10.jar"
>
> The command I'm trying to run is:
> .\spark-shell.cmd --packages com.databricks:spark-csv_2.10:1.0.3
>
> I've tried creating an ivysettings.xml file with the content below in my
> .ivy2 directory, but spark doesn't seem to pick it up. Does anyone have any
> ideas of how to get around this issue?
>
> 
> 
> 
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Change-ivy-cache-for-spark-on-Windows-tp22675.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew,

I observed similar behavior under high GC pressure, when running ALS. What
happened to me was that, there would be very long Full GC pauses (over 600
seconds at times). These would prevent the executors from sending
heartbeats to the driver. Then the driver would think that the executor
died, so it would kill it. The scheduler would look at the outputs and say:
`org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1` or `Fetch Failed`, then reschedule the job at a
different executor.

Then these executors would get even more overloaded, causing them to GC
more often, and new jobs would be launched with even smaller tasks. Because
these executors were being killed by the driver, new jobs with the same
name (and less tasks) would be launched. However, it usually led to a
spiral of death, where executors were constantly being killed, and the
stage wasn't being completed, but restarted with different numbers of tasks.

Some configuration parameters that helped me through this process were:

spark.executor.memory  // decrease the executor memory so that Full GC's
take less time, however are more frequent
spark.executor.heartbeatInterval // This I set at 60 for 600 seconds
(10 minute GC!!)
spark.core.connection.ack.wait.timeout // another timeout to set

Hope these parameters help you. I haven't directly answered your questions,
but there are bits and pieces in there that are hopefully helpful.

Best,
Burak


On Thu, Apr 23, 2015 at 4:11 PM, aleverentz  wrote:

> [My apologies if this is a re-post.  I wasn't subscribed the first time I
> sent this message, and I'm hoping this second message will get through.]
>
> I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In
> a
> fit of blind optimism, I decided to try running MLlib’s Principal
> Components
> Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
> rows.
>
> The Spark job has been running for about 5 hours on a small cluster, and it
> has been stuck on a particular job ("treeAggregate at RowMatrix.scala:119")
> for most of that time.  The treeAggregate job is now on "retry 5", and
> after
> each failure it seems that the next retry uses a smaller number of tasks.
> (Initially, there were around 80 tasks; later it was down to 50, then 42;
> now it’s down to 16.)  The web UI shows the following error under "failed
> stages":  "org.apache.spark.shuffle.MetadataFetchFailedException: Missing
> an
> output location for shuffle 1".
>
> This raises a few questions:
>
> 1. What does "missing an output location for shuffle 1" mean?  I’m guessing
> this cryptic error message is indicative of some more fundamental problem
> (out of memory? out of disk space?), but I’m not sure how to diagnose it.
>
> 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
> that
> the algorithm is actually making progress?  Or is the scheduler just
> performing some kind of repartitioning and starting over from scratch?
> (Also, If the algorithm is in fact making progress, should I expect it to
> finish eventually?  Or do repeated failures generally indicate that the
> cluster is too small to perform the given task?)
>
> 3. Is it reasonable to expect that I could get PCA to run on this dataset
> using the same cluster simply by changing some configuration parameters?
> Or
> is a larger cluster with significantly more resources per node the only way
> around this problem?
>
> 4. In general, are there any tips for diagnosing performance issues like
> the
> one above?  I've spent some time trying to get a few different algorithms
> to
> scale to larger and larger datasets, and whenever I run into a failure, I'd
> like to be able to identify the bottleneck that is preventing further
> scaling.  Any general advice for doing that kind of detective work would be
> much appreciated.
>
> Thanks,
>
> ~ Andrew
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Benchmaking col vs row similarities

2015-04-10 Thread Burak Yavuz
Depends... The heartbeat you received happens due to GC pressure (probably
due to Full GC). If you increase the memory too much, the GC's may be less
frequent, but the Full GC's may take longer. Try increasing the following
confs:

spark.executor.heartbeatInterval
spark.core.connection.ack.wait.timeout

Best,
Burak

On Fri, Apr 10, 2015 at 8:52 PM, Debasish Das 
wrote:

> I will increase memory for the job...that will also fix it right ?
> On Apr 10, 2015 12:43 PM, "Reza Zadeh"  wrote:
>
>> You should pull in this PR: https://github.com/apache/spark/pull/5364
>> It should resolve that. It is in master.
>> Best,
>> Reza
>>
>> On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das 
>> wrote:
>>
>>> Hi,
>>>
>>> I am benchmarking row vs col similarity flow on 60M x 10M matrices...
>>>
>>> Details are in this JIRA:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4823
>>>
>>> For testing I am using Netflix data since the structure is very similar:
>>> 50k x 17K near dense similarities..
>>>
>>> Items are 17K and so I did not activate threshold in colSimilarities yet
>>> (it's at 1e-4)
>>>
>>> Running Spark on YARN with 20 nodes, 4 cores, 16 gb, shuffle threshold
>>> 0.6
>>>
>>> I keep getting these from col similarity code from 1.2 branch. Should I
>>> use Master ?
>>>
>>> 15/04/10 11:08:36 WARN BlockManagerMasterActor: Removing BlockManager
>>> BlockManagerId(5, tblpmidn36adv-hdp.tdc.vzwcorp.com, 44410) with no
>>> recent heart beats: 50315ms exceeds 45000ms
>>>
>>> 15/04/10 11:09:12 ERROR ContextCleaner: Error cleaning broadcast 1012
>>>
>>> java.util.concurrent.TimeoutException: Futures timed out after [30
>>> seconds]
>>>
>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>>
>>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>>
>>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>>
>>> at
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>>
>>> at scala.concurrent.Await$.result(package.scala:107)
>>>
>>> at
>>> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
>>>
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
>>>
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>>>
>>> at
>>> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
>>>
>>> at
>>> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
>>>
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
>>>
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
>>>
>>> at scala.Option.foreach(Option.scala:236)
>>>
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
>>>
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>>>
>>> at
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
>>>
>>> at org.apache.spark.ContextCleaner.org
>>> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
>>>
>>> at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
>>>
>>> I knew how to increase the 45 ms to something higher as it is compute
>>> heavy job but in YARN, I am not sure how to set that config..
>>>
>>> But in any-case that's a warning and should not affect the job...
>>>
>>> Any idea how to improve the runtime other than increasing threshold to
>>> 1e-2 ? I will do that next
>>>
>>> Was netflix dataset benchmarked for col based similarity flow before ?
>>> similarity output from this dataset becomes near dense and so it is
>>> interesting for stress testing...
>>>
>>> Thanks.
>>>
>>> Deb
>>>
>>
>>


Re: Query REST web service with Spark?

2015-03-31 Thread Burak Yavuz
Hi,

If I recall correctly, I've read people integrating REST calls to Spark
Streaming jobs in the user list. I don't imagine any cases for why it
shouldn't be possible.

Best,
Burak

On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir  wrote:

> We have have some data on Hadoop that needs augmented with data only
> available to us via a REST service.  We're using Spark to search for, and
> correct, missing data. Even though there are a lot of records to scour for
> missing data, the total number of calls to the service is expected to be
> low, so it would be ideal to do the whole job in Spark as we scour the data.
>
> I don't see anything obvious in the API or on Google relating to making
> REST calls from a Spark job.  Is it possible?
>
> Thanks,
>
> Alec
>


Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David,

Can you also try with Spark 1.3 if possible? I believe there was a 2x
improvement on K-Means between 1.2 and 1.3.

Thanks,
Burak



On Sat, Mar 28, 2015 at 9:04 PM, davidshen84  wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did. I
> want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Burak Yavuz
Did you build Spark with: -Pnetlib-lgpl?

Ref: https://spark.apache.org/docs/latest/mllib-guide.html

Burak

On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu  wrote:

> How about pointing LD_LIBRARY_PATH to native lib folder ?
>
> You need Spark 1.2.0 or higher for the above to work. See SPARK-1719
>
> Cheers
>
> On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen  wrote:
>
>> Hi Ted,
>>
>> I have tried to invoke the command from both cygwin environment and
>> powershell environment. I still get the messages:
>>
>> 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
>> com.github.fommil.netlib.NativeSystemBLAS
>> 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
>> com.github.fommil.netlib.NativeRefBLAS
>>
>> From the Spark UI, I can see:
>>
>>   spark.driver.extraLibrary c:\openblas
>>
>>
>> Thanks,
>> David
>>
>>
>> On Sun, Mar 22, 2015 at 11:45 AM Ted Yu  wrote:
>>
>>> Can you try the --driver-library-path option ?
>>>
>>> spark-submit --driver-library-path /opt/hadoop/lib/native ...
>>>
>>> Cheers
>>>
>>> On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen  wrote:
>>>
 Hi,

 I use the *OpenBLAS* DLL, and have configured my application to work
 in IDE. When I start my Spark application from IntelliJ IDE, I can see in
 the log that the native lib is loaded successfully.

 But if I use *spark-submit* to start my application, the native lib
 still cannot be load. I saw the WARN message that it failed to load both
 the native and native-ref library. I checked the *Environment* tab in
 the Spark UI, and the *java.library.path* is set correctly.


 Thanks,

 David



>>>
>


Re: RDD ordering after map

2015-03-18 Thread Burak Yavuz
Hi,
Yes, ordering is preserved with map. Shuffles break ordering.

Burak

On Wed, Mar 18, 2015 at 2:02 PM, sergunok  wrote:

> Does map(...) preserve ordering of original RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp22129.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread Burak Yavuz
Hi,

I would suggest you use LBFGS, as I think the step size is hurting you. You
can run the same thing in LBFGS as:

```
val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
val initialWeights = Vectors.dense(Array.fill(3)(
scala.util.Random.nextDouble()))

val weights = algorithm.optimize(parsedData,initialWeights)
```

Note that weights will be a Vector and not a model. You can then generate
the model with:

val w = weights.toArray
val intercept = w.takeRight(1).head()
val model = new LinearRegressionModel(Vectors.dense(w.dropRight(1)),
intercept)


Best,
Burak

On Wed, Mar 11, 2015 at 11:59 AM, EcoMotto Inc.  wrote:

> Hello,
>
> I am trying to run LinearRegression on a dummy data set, given below. Here
> I tried all different settings but I am still failing to reproduce desired
> coefficients.
>
> Please help me out, as I facing the same problem in my actual dataset.
> Thank you.
>
> This dataset is generated based on the simple equation: Y = 4 + (2 * x1) +
> (3 * x2)
>
> *Data:*
> y,x1,x2
> 6.3,1,0.1
> 8.6,2,0.2
> 10.9,3,0.3
> 13.8,4,0.6
> 16.4,5,0.8
> 19.6,6,1.2
> 22.8,7,1.6
> 25.7,8,1.9
> 28.3,9,2.1
> 31.2,10,2.4
> 34.1,11,2.7
>
> *Spark Code:*
> val data = sc.textFile("Data/tempData_1.csv" )
>
> val parsedData = data.mapPartitions(_.drop(1)).map {
> line =>
> val parts = line.split(',')
> LabeledPoint(parts(0).toDouble,Vectors.dense(Array(1.0,parts(1).toDouble,parts(2).toDouble)))
>   }.cache()
>
> var numIterations = 400
> val step = 0.01
> val algorithm = new LinearRegressionWithSGD()
> algorithm.setIntercept(false) //Even tried with intercept(True) and just
> (x1,x2) features
> algorithm.optimizer.setStepSize(step)
> algorithm.optimizer.setNumIterations(numIterations)
> .setUpdater(new SimpleUpdater())
> //.setRegParam(0.1)
> .setMiniBatchFraction(1.0)
>
> val initialWeights =
> Vectors.dense(Array.fill(3)(scala.util.Random.nextDouble()))
>
> val model = algorithm.run(parsedData,initialWeights)
> println(s" Model intercept: ${model.intercept}, weights:
> ${model.weights}")
>
>
>
> Regards,
> Arun
>


Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-09 Thread Burak Yavuz
Hi Jaonary,

The RowPartitionedMatrix is a special case of the BlockMatrix, where the
colsPerBlock = nCols. I hope that helps.

Burak
On Mar 6, 2015 9:13 AM, "Jaonary Rabarisoa"  wrote:

> Hi Shivaram,
>
> Thank you for the link. I'm trying to figure out how can I port this to
> mllib. May you can help me to understand how pieces fit together.
> Currently, in mllib there's different types of distributed matrix :
>
> BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one
> should correspond to RowPartitionedMatrix in ml-matrix ?
>
>
>
> On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> There are couple of solvers that I've written that is part of the AMPLab
>> ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
>> interested in porting them I'd be happy to review it
>>
>> Thanks
>> Shivaram
>>
>>
>> [1]
>> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
>> [2]
>> https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala
>>
>> On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa 
>> wrote:
>>
>>> Dear all,
>>>
>>> Is there a least square solver based on DistributedMatrix that we can
>>> use out of the box in the current (or the master) version of spark ?
>>> It seems that the only least square solver available in spark is private
>>> to recommender package.
>>>
>>>
>>> Cheers,
>>>
>>> Jao
>>>
>>
>>
>


Re: what are the types of tasks when running ALS iterations

2015-03-09 Thread Burak Yavuz
+user
On Mar 9, 2015 8:47 AM, "Burak Yavuz"  wrote:

> Hi,
> In the web UI, you don't see every single task. You see the name of the
> last task before the stage boundary (which is a shuffle like a groupByKey),
> which in your case is a flatMap. Therefore you only see flatMap in the UI.
> The groupByKey and the flatMap that follows form a single stage. Please
> take a look at
>
> http://www.slideshare.net/mobile/pwendell/tuning-and-debugging-in-apache-spark
> for further reference.
>
> Burak
> On Mar 8, 2015 11:44 PM, "lisendong"  wrote:
>
>> you see, the core of ALS 1.0.0 is the following code:
>> there should be flatMap and groupByKey when running ALS iterations ,
>> right?
>> but when I run als iteration, there are ONLY flatMap tasks...
>> do you know why?
>>
>>  private def updateFeatures(
>>   products: RDD[(Int, Array[Array[Double]])],
>>   productOutLinks: RDD[(Int, OutLinkBlock)],
>>   userInLinks: RDD[(Int, InLinkBlock)],
>>   partitioner: Partitioner,
>>   rank: Int,
>>   lambda: Double,
>>   alpha: Double,
>>   YtY: Option[Broadcast[DoubleMatrix]])
>>   : RDD[(Int, Array[Array[Double]])] =
>>   {
>> val numBlocks = products.partitions.size
>> productOutLinks.join(products).flatMap { case (bid, (outLinkBlock,
>> factors)) =>
>>   val toSend = Array.fill(numBlocks)(new ArrayBuffer[Array[Double]])
>>   for (p <- 0 until outLinkBlock.elementIds.length; userBlock <- 0
>> until
>> numBlocks) {
>> if (outLinkBlock.shouldSend(p)(userBlock)) {
>>   toSend(userBlock) += factors(p)
>> }
>>   }
>>   toSend.zipWithIndex.map{ case (buf, idx) => (idx, (bid,
>> buf.toArray))
>> }
>> }.groupByKey(new HashPartitioner(numBlocks)) //这里1.0.0 的
>> als代码有bug,那个版本用的是传入的partitioner,起不到作用,会导致data skew
>>   .join(userInLinks)
>>   .mapValues{ case (messages, inLinkBlock) =>
>>   updateBlock(messages, inLinkBlock, rank, lambda, alpha, YtY)
>> }
>>   }
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/what-are-the-types-of-tasks-when-running-ALS-iterations-tp21966.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: How to reuse a ML trained model?

2015-03-07 Thread Burak Yavuz
Hi,

There is model import/export for some of the ML algorithms on the current
master (and they'll be shipped with the 1.3 release).

Burak
On Mar 7, 2015 4:17 AM, "Xi Shen"  wrote:

> Wait...it seem SparkContext does not provide a way to save/load object
> files. It can only save/load RDD. What do I missed here?
>
>
> Thanks,
> David
>
>
> On Sat, Mar 7, 2015 at 11:05 PM Xi Shen  wrote:
>
>> Ah~it is serializable. Thanks!
>>
>>
>> On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy  wrote:
>>
>>> You can serialize your trained model to persist somewhere.
>>>
>>> Ekrem Aksoy
>>>
>>> On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen  wrote:
>>>
 Hi,

 I checked a few ML algorithms in MLLib.

 https://spark.apache.org/docs/0.8.1/api/mllib/index.html#
 org.apache.spark.mllib.classification.LogisticRegressionModel

 I could not find a way to save the trained model. Does this means I
 have to train my model every time? Is there a more economic way to do this?

 I am thinking about something like:

 model.run(...)
 model.save("hdfs://path/to/hdfs")

 Then, next I can do:

 val model = Model.createFrom("hdfs://...")
 model.predict(vector)

 I am new to spark, maybe there are other ways to persistent the model?


 Thanks,
 David


>>>


Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert,

Would you like to register this on spark-packages.org?

Burak

On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers  wrote:

> currently spark provides many excellent algorithms for operations per key
> as long as the data send to the reducers per key fits in memory. operations
> like combineByKey, reduceByKey and foldByKey rely on pushing the operation
> map-side so that the data reduce-side is small. and groupByKey simply
> requires that the values per key fit in memory.
>
> but there are algorithms for which we would like to process all the values
> per key reduce-side, even when they do not fit in memory. examples are
> algorithms that need to process the values ordered, or algorithms that need
> to emit all values again. basically this is what the original hadoop reduce
> operation did so well: it allowed sorting of values (using secondary sort),
> and it processed all values per key in a streaming fashion.
>
> the library spark-sorted aims to bring these kind of operations back to
> spark, by providing a way to process values with a user provided
> Ordering[V] and a user provided streaming operation Iterator[V] =>
> Iterator[W]. it does not make the assumption that the values need to fit in
> memory per key.
>
> the basic idea is to rely on spark's sort-based shuffle to re-arrange the
> data so that all values for a given key are placed consecutively within a
> single partition, and then process them using a map-like operation.
>
> you can find the project here:
> https://github.com/tresata/spark-sorted
>
> the project is in a very early stage. any feedback is very much
> appreciated.
>
>
>
>


Re: Problem getting program to run on 15TB input

2015-02-27 Thread Burak Yavuz
Hi,

Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates
many small objects that lead to very long GC time, causing the executor
losts, heartbeat not received, and GC overhead limit exceeded messages.
Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also
try `OFF_HEAP` (and use Tachyon).

Burak

On Fri, Feb 27, 2015 at 11:39 AM, Arun Luthra  wrote:

> My program in pseudocode looks like this:
>
> val conf = new SparkConf().setAppName("Test")
>   .set("spark.storage.memoryFraction","0.2") // default 0.6
>   .set("spark.shuffle.memoryFraction","0.12") // default 0.2
>   .set("spark.shuffle.manager","SORT") // preferred setting for
> optimized joins
>   .set("spark.shuffle.consolidateFiles","true") // helpful for "too
> many files open"
>   .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker
> errors?
>   .set("spark.akka.frameSize","500") // helpful when using
> consildateFiles=true
>   .set("spark.akka.askTimeout", "30")
>   .set("spark.shuffle.compress","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>   .set("spark.file.transferTo","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>   .set("spark.core.connection.ack.wait.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>   .set("spark.speculation","true")
>   .set("spark.worker.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>   .set("spark.akka.timeout","300") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>   .set("spark.storage.blockManagerSlaveTimeoutMs","12")
>   .set("spark.driver.maxResultSize","2048") // in response to error:
> Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
>   .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator","com.att.bdcoe.cip.ooh.MyRegistrator")
>   .set("spark.kryo.registrationRequired", "true")
>
> val rdd1 = 
> sc.textFile(file1).persist(StorageLevel.MEMORY_AND_DISK_SER).map(_.split("\\|",
> -1)...filter(...)
>
> val rdd2 =
> sc.textFile(file2).persist(StorageLevel.MEMORY_AND_DISK_SER).map(_.split("\\|",
> -1)...filter(...)
>
>
> rdd2.union(rdd1).map(...).filter(...).groupByKey().map(...).flatMap(...).saveAsTextFile()
>
>
> I run the code with:
>   --num-executors 500 \
>   --driver-memory 20g \
>   --executor-memory 20g \
>   --executor-cores 32 \
>
>
> I'm using kryo serialization on everything, including broadcast variables.
>
> Spark creates 145k tasks, and the first stage includes everything before
> groupByKey(). It fails before getting to groupByKey. I have tried doubling
> and tripling the number of partitions when calling textFile, with no
> success.
>
> Very similar code (trivial changes, to accomodate different input) worked
> on a smaller input (~8TB)... Not that it was easy to get that working.
>
>
>
> Errors vary, here is what I am getting right now:
>
> ERROR SendingConnection: Exception while reading SendingConnection
> ... java.nio.channels.ClosedChannelException
> (^ guessing that is symptom of something else)
>
> WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(...) with no recent heart beats: 120030ms exceeds 12ms
> (^ guessing that is symptom of something else)
>
> ERROR ActorSystemImpl: Uncaught fatal error from thread (...) shutting
> down ActorSystem [sparkDriver]
> *java.lang.OutOfMemoryError: GC overhead limit exceeded*
>
>
>
> Other times I will get messages about "executor lost..." about 1 message
> per second, after ~~50k tasks complete, until there are almost no executors
> left and progress slows to nothing.
>
> I ran with verbose GC info; I do see failing yarn containers that have
> multiple (like 30) "Full GC" messages but I don't know how to interpret if
> that is the problem. Typical Full GC time taken seems ok: [Times:
> user=23.30 sys=0.06, real=1.94 secs]
>
>
>
> Suggestions, please?
>
> Huge thanks for useful suggestions,
> Arun
>


Re: Why is RDD lookup slow?

2015-02-19 Thread Burak Yavuz
If your dataset is large, there is a Spark Package called IndexedRDD
optimized for lookups. Feel free to check that out.

Burak
On Feb 19, 2015 7:37 AM, "Ilya Ganelin"  wrote:

> Hi Shahab - if your data structures are small enough a broadcasted Map is
> going to provide faster lookup. Lookup within an RDD is an O(m) operation
> where m is the size of the partition. For RDDs with multiple partitions,
> executors can operate on it in parallel so you get some improvement for
> larger RDDs.
> On Thu, Feb 19, 2015 at 7:31 AM shahab  wrote:
>
>> Hi,
>>
>> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
>> lookup is relatively slow 30-100 ms ?? I even tried this on one machine
>> with single partition, but no difference!
>>
>> The RDDs are not large at all, 3-30 MB.
>>
>> Is this expected behaviour? should I use other data structures, like
>> HashMap to keep data and look up it there and use Broadcast to send a copy
>> to all machines?
>>
>> best,
>> /Shahab
>>
>>
>>


Re: generate a random matrix with uniform distribution

2015-02-09 Thread Burak Yavuz
Sorry about that, yes, it should be uniformVectorRDD. Thanks Sean!

Burak

On Mon, Feb 9, 2015 at 2:05 AM, Sean Owen  wrote:

> Yes the example given here should have used uniformVectorRDD. Then it's
> correct.
>
> On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini  wrote:
> > Thanks a lot!
> > Can I ask why this code generates a uniform distribution?
> >
> > If dist is N(0,1) data should be  N(-1, 2).
> >
> > Let me know.
> > Thanks,
> > Luca
> >
> > 2015-02-07 3:00 GMT+00:00 Burak Yavuz :
> >>
> >> Hi,
> >>
> >> You can do the following:
> >> ```
> >> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> >> import org.apache.spark.mllib.random._
> >>
> >> // sc is the spark context, numPartitions is the number of partitions
> you
> >> want the RDD to be in
> >> val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,
> >> numPartitions, seed)
> >> // make the distribution uniform between (-1, 1)
> >> val data = dist.map(_ * 2  - 1)
> >> val matrix = new RowMatrix(data, n, k)
> >>
> >> On Feb 6, 2015 11:18 AM, "Donbeo"  wrote:
> >>>
> >>> Hi
> >>> I would like to know how can I generate a random matrix where each
> >>> element
> >>> come from a uniform distribution in -1, 1 .
> >>>
> >>> In particular I would like the matrix be a distributed row matrix with
> >>> dimension n x p
> >>>
> >>> Is this possible with mllib? Should I use another library?
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/generate-a-random-matrix-with-uniform-distribution-tp21538.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
>


Re: generate a random matrix with uniform distribution

2015-02-06 Thread Burak Yavuz
Hi,

You can do the following:
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._

// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions,
seed)
// make the distribution uniform between (-1, 1)
val data = dist.map(_ * 2  - 1)
val matrix = new RowMatrix(data, n, k)
On Feb 6, 2015 11:18 AM, "Donbeo"  wrote:

> Hi
> I would like to know how can I generate a random matrix where each element
> come from a uniform distribution in -1, 1 .
>
> In particular I would like the matrix be a distributed row matrix with
> dimension n x p
>
> Is this possible with mllib? Should I use another library?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/generate-a-random-matrix-with-uniform-distribution-tp21538.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Forgot to add the more recent training material:
https://databricks-training.s3.amazonaws.com/index.html

On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz  wrote:

> Hi Luca,
>
> You can tackle this using RowMatrix (spark-shell example):
> ```
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.mllib.random._
>
> // sc is the spark context, numPartitions is the number of partitions you
> want the RDD to be in
> val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,
> numPartitions, seed)
> val matrix = new RowMatrix(data, n, k)
> ```
>
> You can find more tutorials here:
> https://spark-summit.org/2013/exercises/index.html
>
> Best,
> Burak
>
>
>
>
> On Fri, Feb 6, 2015 at 10:03 AM, Luca Puggini 
> wrote:
>
>> Hi all,
>> this is my first email with this mailing list and I hope that I am not
>> doing anything wrong.
>>
>> I am currently trying to define a distributed matrix with n rows and k
>> columns where each element is randomly sampled by a uniform distribution.
>> How can I do that?
>>
>> It would be also nice if you can suggest me any good guide that I can use
>> to start working with Spark. (The quick start tutorial is not enough for me
>> )
>>
>> Thanks a lot !
>>
>
>


Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Hi Luca,

You can tackle this using RowMatrix (spark-shell example):
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._

// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions,
seed)
val matrix = new RowMatrix(data, n, k)
```

You can find more tutorials here:
https://spark-summit.org/2013/exercises/index.html

Best,
Burak




On Fri, Feb 6, 2015 at 10:03 AM, Luca Puggini  wrote:

> Hi all,
> this is my first email with this mailing list and I hope that I am not
> doing anything wrong.
>
> I am currently trying to define a distributed matrix with n rows and k
> columns where each element is randomly sampled by a uniform distribution.
> How can I do that?
>
> It would be also nice if you can suggest me any good guide that I can use
> to start working with Spark. (The quick start tutorial is not enough for me
> )
>
> Thanks a lot !
>


Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz
Hi,

The MatrixFactorizationModel consists of two RDD's. When you use the second 
method, Spark tries to serialize both RDD's for the .map() function, 
which is not possible, because RDD's are not serializable. Therefore you 
receive the NULLPointerException. You must use the first method.

Best,
Burak

- Original Message -
From: "Franco Barrientos" 
To: user@spark.apache.org
Sent: Wednesday, December 24, 2014 7:44:24 AM
Subject: null Error in ALS model predict

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)=> 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

  franco.barrien...@exalitica.com 

  www.exalitica.com


   

 



Re: How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Burak Yavuz
Hi,

https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf
contains some performance tests for streaming. There are examples of how to 
generate synthetic files during the test in that repo, maybe you
can find some code snippets that you can use there.

Best,
Burak

- Original Message -
From: "Emre Sevinc" 
To: user@spark.apache.org
Sent: Monday, December 8, 2014 2:36:41 AM
Subject: How can I make Spark Streaming count the words in a file in a unit 
test?

Hello,

I've successfully built a very simple Spark Streaming application in Java
that is based on the HdfsCount example in Scala at
https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
.

When I submit this application to my local Spark, it waits for a file to be
written to a given directory, and when I create that file it successfully
prints the number of words. I terminate the application by pressing Ctrl+C.

Now I've tried to create a very basic unit test for this functionality, but
in the test I was not able to print the same information, that is the
number of words.

What am I missing?

Below is the unit test file, and after that I've also included the code
snippet that shows the countWords method:

=
StarterAppTest.java
=
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;


import org.junit.*;

import java.io.*;

public class StarterAppTest {

  JavaStreamingContext ssc;
  File tempDir;

  @Before
  public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
  }

  @After
  public void tearDown() {
ssc.stop();
ssc = null;
  }

  @Test
  public void testInitialization() {
Assert.assertNotNull(ssc.sc());
  }


  @Test
  public void testCountWords() {

StarterApp starterApp = new StarterApp();

try {
  JavaDStream lines =
ssc.textFileStream(tempDir.getAbsolutePath());
  JavaPairDStream wordCounts =
starterApp.countWords(lines);

  System.err.println("= Word Counts ===");
  wordCounts.print();
  System.err.println("= Word Counts ===");

  ssc.start();

  File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
  PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
  writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
  writer.close();

  System.err.println("= Word Counts ===");
  wordCounts.print();
  System.err.println("= Word Counts ===");

} catch (FileNotFoundException e) {
  e.printStackTrace();
} catch (UnsupportedEncodingException e) {
  e.printStackTrace();
}


Assert.assertTrue(true);

  }

}
=

This test compiles and starts to run, Spark Streaming prints a lot of
diagnostic messages on the console but the calls to wordCounts.print();
does not print anything, whereas in StarterApp.java itself, they do.

I've also added ssc.awaitTermination(); after ssc.start() but nothing
changed in that respect. After that I've also tried to create a new file in
the directory that this Spark Streaming application was checking but this
time it gave an error.

For completeness, below is the wordCounts method:


public JavaPairDStream countWords(JavaDStream
lines) {
JavaDStream words = lines.flatMap(new FlatMapFunction() {
  @Override
  public Iterable call(String x) { return
Lists.newArrayList(SPACE.split(x)); }
});

JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
  @Override
  public Tuple2 call(String s) { return new
Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);

return wordCounts;
  }




Kind regards
Emre Sevinç


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Burak Yavuz
Hi,

I've come across this multiple times, but not in a consistent manner. I found 
it hard to reproduce. I have a jira for it: SPARK-3080

Do you observe this error every single time? Where do you load your data from? 
Which version of Spark are you running? 
Figuring out the similarities may help in pinpointing the bug.

Thanks,
Burak

- Original Message -
From: "Ilya Ganelin" 
To: "user" 
Sent: Monday, October 27, 2014 11:36:46 AM
Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

Hello all - I am attempting to run MLLib's ALS algorithm on a substantial
test vector - approx. 200 million records.

I have resolved a few issues I've had with regards to garbage collection,
KryoSeralization, and memory usage.

I have not been able to get around this issue I see below however:


> java.lang.
> ArrayIndexOutOfBoundsException: 6106
>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.
> scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> org.apache.spark.mllib.recommendation.ALS.org
> $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
>
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


I do not have any negative indices or indices that exceed Int-Max.

I have partitioned the input data into 300 partitions and my Spark config
is below:

.set("spark.executor.memory", "14g")
  .set("spark.storage.memoryFraction", "0.8")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", "MyRegistrator")
  .set("spark.core.connection.ack.wait.timeout","600")
  .set("spark.akka.frameSize","50")
  .set("spark.yarn.executor.memoryOverhead","1024")

Does anyone have any suggestions as to why i'm seeing the above error or
how to get around it?
It may be possible to upgrade to the latest version of Spark but the
mechanism for doing so in our environment isn't obvious yet.

-Ilya Ganelin


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray,

The reduceByKey / collectAsMap does a lot of calculations. Therefore it can 
take a very long time if:
1) The parameter number of runs is set very high
2) k is set high (you have observed this already)
3) data is not properly repartitioned
It seems that it is hanging, but there is a lot of calculation going on.

Did you use a different value for the number of runs?
If you look at the storage tab, does the data look balanced among executors?

Best,
Burak

- Original Message -
From: "Ray" 
To: u...@spark.incubator.apache.org
Sent: Tuesday, October 14, 2014 2:58:03 PM
Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap

Hi Xiangrui,

The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.


Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can
see the application got  201 vCores. From the spark UI, I can see it got 201
executors (as shown below).


  


 



Thanks.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi,

It appears that the step size is too high that the model is diverging with the 
added noise. 
Could you try by setting the step size to be 0.1 or 0.01?

Best,
Burak

- Original Message -
From: "Krishna Sankar" 
To: user@spark.apache.org
Sent: Wednesday, October 1, 2014 12:43:20 PM
Subject: MLlib Linear Regression Mismatch

Guys,
   Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(10.0, [10.0]),
   LabeledPoint(20.0, [20.0]),
   LabeledPoint(30.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0]))
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

output:


[ 1.]
0.0

40.0

b) By perturbing the y a little bit, the model gives wrong results:

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(9.0, [10.0]),
   LabeledPoint(22.0, [20.0]),
   LabeledPoint(32.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be 1.09x -0.60
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

Output:


[ -8.20487463e+203]
0.0

-3.2819498532740317e+205

c) Same story here - wrong results. Actually nan:

data = [
   LabeledPoint(18.9, [3910.0]),
   LabeledPoint(17.0, [3860.0]),
   LabeledPoint(20.0, [4200.0]),
   LabeledPoint(16.6, [3660.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([4000])

Output:

[ nan]
0.0

nan

Cheers & Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on EC2

2014-09-18 Thread Burak Yavuz
Hi Gilberto,

Could you please attach the driver logs as well, so that we can pinpoint what's 
going wrong? Could you also add the flag
`--driver-memory 4g` while submitting your application and try that as well?

Best,
Burak

- Original Message -
From: "Gilberto Lira" 
To: user@spark.apache.org
Sent: Thursday, September 18, 2014 11:48:03 AM
Subject: Spark on EC2

Hello, I am trying to run a python script that makes use of the kmeans MLIB and 
I'm not getting anywhere. I'm using an c3.xlarge instance as master, and 10 
c3.large instances as slaves. In the code I make a map of a 600MB csv file in 
S3, where each row has 128 integer columns. The problem is that around the TID7 
my slave stops responding, and I can not finish my processing. Could you help 
me with this problem? I sending my script attached for review. 

Thank you, 
Gilberto 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Odd error when using a rdd map within a stream map

2014-09-18 Thread Burak Yavuz
Hi,

I believe it's because you're trying to use a Function of an RDD, in an RDD, 
which is not possible. Instead of using a
`Function>`, could you try Function, and
`public Void call(Float arg0) throws Exception { `
and
`System.out.println(arg0)`

instead. I'm not perfectly sure of the semantics in Java, but this should be 
what you're actually trying to do.

Best,
Burak

- Original Message -
From: "Filip Andrei" 
To: u...@spark.incubator.apache.org
Sent: Thursday, September 18, 2014 6:57:21 AM
Subject: Odd error when using a rdd map within a stream map

here i wrote a simpler version of the code to get an understanding of how it
works:

final List nns = new ArrayList(); 
for(int i = 0; i < numberOfNets; i++){ 
nns.add(NeuralNet.createFrom(...)); 
} 

final JavaRDD nnRdd = sc.parallelize(nns);   
JavaDStream results = rndLists.flatMap(new
FlatMapFunction, Float>() { 
@Override 
public Iterable call(Map input) 
throws Exception { 

Float f = nnRdd.map(new Function() { 

@Override 
public Float call(NeuralNet nn) throws Exception { 

return 1.0f; 
} 
}).reduce(new Function2() { 

@Override 
public Float call(Float left, Float right) throws Exception { 

return left + right; 
} 
}); 

return Arrays.asList(f); 
} 
}); 
results.print();


This works as expected and print() simply shows the number of neural nets i
have
If instead a print() i use

results.foreach(new Function, Void>() { 

@Override 
public Void call(JavaRDD arg0) throws Exception { 


for(Float f : arg0.collect()){ 
System.out.println(f); 
} 
return null; 
} 
});

It fails with the following exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task
1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on
host localhost: java.lang.NullPointerException 
org.apache.spark.rdd.RDD.map(RDD.scala:270)

This is weird to me since the same code executes as expected in one case and
doesn't in the other, any idea what's going on here ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Odd-error-when-using-a-rdd-map-within-a-stream-map-tp14551.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python version of kmeans

2014-09-17 Thread Burak Yavuz
Hi,

spark-1.0.1/examples/src/main/python/kmeans.py => Naive example for users to 
understand how to code in Spark
spark-1.0.1/python/pyspark/mllib/clustering.py => Use this!!!

Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py => Example on how 
to call KMeans. Feel free to use it as a template!

Best,
Burak

- Original Message -
From: "MEETHU MATHEW" 
To: user@spark.apache.org
Sent: Wednesday, September 17, 2014 10:26:40 PM
Subject: Python version of kmeans

Hi all,

I need the kmeans code written against Pyspark for some testing purpose.
Can somebody tell me the difference between these two files.

 spark-1.0.1/examples/src/main/python/kmeans.py   and 

 spark-1.0.1/python/pyspark/mllib/clustering.py


Thanks & Regards, 
Meethu M


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLLib: LIBSVM issue

2014-09-17 Thread Burak Yavuz
Hi,

The spacing between the inputs should be a single space, not a tab. I feel like 
your inputs have tabs between them instead of a single space. Therefore the 
parser
cannot parse the input.

Best,
Burak

- Original Message -
From: "Sameer Tilak" 
To: user@spark.apache.org
Sent: Wednesday, September 17, 2014 7:25:10 PM
Subject: MLLib: LIBSVM issue

Hi All,We have a fairly large amount of sparse data. I was following the 
following instructions in the manual:
Sparse dataIt is very common in practice to have sparse training data. MLlib 
supports reading training examples stored in LIBSVM format, which is the 
default format used by LIBSVM and LIBLINEAR. It is a text format in which each 
line represents a labeled sparse feature vector using the following 
format:label index1:value1 index2:value2 ...
import org.apache.spark.mllib.regression.LabeledPointimport 
org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.rdd.RDD
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
I believe that I have formatted my data as per the required Libsvm format. Here 
is a snippet of that:
1122:11693:11771:11974:12334:1
2378:12562:1 1118:11389:11413:11454:1   
 1780:12562:15051:15417:15548:1
5798:15862:1 0150:1214:1468:11013:1 
   1078:11092:11117:11489:11546:11630:1 
   1635:11827:12024:12215:12478:1
2761:15985:16115:16218:1 0251:15578:1 
However,When I use MLUtils.loadLibSVMFile(sc, "path-to-data-file")I get the 
following error messages in mt spark-shell. Can someone please point me in 
right direction.
java.lang.NumberFormatException: For input string: "150:1214:1
468:11013:11078:11092:11117:11489:1 
   1546:11630:11635:11827:12024:12215:1 
   2478:12761:15985:16115:16218:1" 
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241) 
at java.lang.Double.parseDouble(Double.java:540) at 
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)  
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Burak Yavuz
Hi,

Could you try repartitioning the data by .repartition(# of cores on machine) or 
while reading the data, supply the number of minimum partitions as in
sc.textFile(path, # of cores on machine).

It may be that the whole data is stored in one block? If it is billions of 
rows, then the indexing probably will not work giving the "exceeds 
Integer.MAX_VALUE" error.

Best,
Burak

- Original Message -
From: "francisco" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, September 17, 2014 3:18:29 PM
Subject: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

Hi,

We are running aggregation on a huge data set (few billion rows).
While running the task got the following error (see below). Any ideas?
Running spark 1.1.0 on cdh distribution.

...
14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0).
2083 bytes result sent to driver
14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 1
14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 2.0 (TID 1)
14/09/17 13:33:30 INFO TorrentBroadcast: Started reading broadcast variable
2
14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(1428) called with
curMem=163719, maxMem=34451478282
14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes
in memory (estimated size 1428.0 B, free 32.1 GB)
14/09/17 13:33:30 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/09/17 13:33:30 INFO TorrentBroadcast: Reading broadcast variable 2 took
0.027374294 s
14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(2336) called with
curMem=165147, maxMem=34451478282
14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 2.3 KB, free 32.1 GB)
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Updating epoch to 1 and
clearing cache
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Don't have map outputs for
shuffle 1, fetching them
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker
actor =
Actor[akka.tcp://sparkdri...@sas-model1.pv.sv.nextag.com:56631/user/MapOutputTracker#794212052]
14/09/17 13:33:30 INFO MapOutputTrackerWorker: Got the output locations
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 1 non-empty blocks out of 1 blocks
14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 0 remote fetches in 8 ms
14/09/17 13:33:30 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Error occurred while fetching local blocks
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:120)
at
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:358)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:208)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:205)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.getLocalBlocks(BlockFetcherIterator.scala:205)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.initialize(BlockFetcherIterator.scala:240)
at
org.apache.spark.storage.BlockManager.getMultiple(BlockManager.scala:583)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:77)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:41)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/09/17 13:33:30 INFO CoarseGrainedExecuto

Re: How to run kmeans after pca?

2014-09-17 Thread Burak Yavuz
To properly perform PCA, you must left multiply the resulting DenseMatrix with 
the original RowMatrix. The result will also be a RowMatrix,
therefore you can easily access the values by .values, and train KMeans on that.

Don't forget to Broadcast the DenseMatrix returned from 
RowMatrix.computePrincipalComponents(), otherwise you'll get an OOME.

Here's how to do it in Scala: (didn't run the code, but should be something 
like this)

val data: RowMatrix = ...
val bcPrincipalComponents: DenseMatrix = 
data.context.broadcast(data.computePrincipalComponents())

val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)

KMeans.run(newData.values)

Best,
Burak



- Original Message -
From: "st553" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, September 17, 2014 12:21:38 PM
Subject: How to run kmeans after pca?

I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Yes, writing to HDFS is more expensive, but I feel it is still a small price to 
pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.

The main goal of checkpointing is to truncate the lineage. Clearing up shuffle 
writes come as a bonus to checkpointing, it is not the main goal. The 
subtlety here is that .checkpoint() is just like .cache(). Until you call an 
action, nothing happens. Therefore, if you're going to do 1000 maps in a 
row and you don't want to checkpoint in the meantime until a shuffle happens, 
you will still get a StackOverflowError, because the lineage is too long.

I went through some of the code for checkpointing. As far as I can tell, it 
materializes the data in HDFS, and resets all its dependencies, so you start 
a fresh lineage. My understanding would be that checkpointing still should be 
done every N operations to reset the lineage. However, an action must be 
performed before the lineage grows too long.

I believe it would be nice to write up checkpointing in the programming guide. 
The reason that it's not there yet I believe is that most applications don't
grow such a long lineage, except in Spark Streaming, and some MLlib algorithms. 
If you can help with the guide, I think it would be a nice feature to have!

Burak


- Original Message -
From: "Andrew Ash" 
To: "Burak Yavuz" 
Cc: "Макар Красноперов" , "user" 

Sent: Wednesday, September 17, 2014 11:04:02 AM
Subject: Re: Spark and disk usage.

Thanks for the info!

Are there performance impacts with writing to HDFS instead of local disk?
 I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.

Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting maps), since only the shuffle
leaves data on disk.  Do you have any suggestions on this?

We should write up some guidance on the use of checkpointing in the programming
guide <https://spark.apache.org/docs/latest/programming-guide.html> - I can
help with this

Andrew


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi Andrew,

Yes, I'm referring to sparkContext.setCheckpointDir(). It has the same effect 
as in Spark Streaming. 
For example, in an algorithm like ALS, the RDDs go through many transformations 
and the lineage of the RDD starts to grow drastically just like 
the lineage of DStreams do in Spark Streaming. You may observe 
StackOverflowErrors in ALS if you set the number of iterations to be very high. 

If you set the checkpointing directory however, the intermediate state of the 
RDDs will be saved in HDFS, and the lineage will pick off from there. 
You won't need to keep the shuffle data before the checkpointed state, 
therefore those can be safely removed (will be removed automatically).
However, checkpoint must be called explicitly as in 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
 ,just setting the directory will not be enough.

Best,
Burak

- Original Message -
From: "Andrew Ash" 
To: "Burak Yavuz" 
Cc: "Макар Красноперов" , "user" 

Sent: Wednesday, September 17, 2014 10:19:42 AM
Subject: Re: Spark and disk usage.

Hi Burak,

Most discussions of checkpointing in the docs is related to Spark
streaming.  Are you talking about the sparkContext.setCheckpointDir()?
 What effect does that have?

https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

On Wed, Sep 17, 2014 at 7:44 AM, Burak Yavuz  wrote:

> Hi,
>
> The files you mentioned are temporary files written by Spark during
> shuffling. ALS will write a LOT of those files as it is a shuffle heavy
> algorithm.
> Those files will be deleted after your program completes as Spark looks
> for those files in case a fault occurs. Having those files ready allows
> Spark to
> continue from the stage the shuffle left off, instead of starting from the
> very beginning.
>
> Long story short, it's to your benefit that Spark writes those files to
> disk. If you don't want Spark writing to disk, you can specify a checkpoint
> directory in
> HDFS, where Spark will write the current status instead and will clean up
> files from disk.
>
> Best,
> Burak
>
> - Original Message -
> From: "Макар Красноперов" 
> To: user@spark.apache.org
> Sent: Wednesday, September 17, 2014 7:37:49 AM
> Subject: Spark and disk usage.
>
> Hello everyone.
>
> The problem is that spark write data to the disk very hard, even if
> application has a lot of free memory (about 3.8g).
> So, I've noticed that folder with name like
> "spark-local-20140917165839-f58c" contains a lot of other folders with
> files like "shuffle_446_0_1". The total size of files in the dir
> "spark-local-20140917165839-f58c" can reach 1.1g.
> Sometimes its size decreases (are there only temp files in that folder?),
> so the totally amount of data written to the disk is greater than 1.1g.
>
> The question is what kind of data Spark store there and can I make spark
> not to write it on the disk and just keep it in the memory if there is
> enough RAM free space?
>
> I run my job locally with Spark 1.0.1:
> ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file
> conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar
>
> spark-defaults.conf :
> spark.shuffle.spill false
> spark.reducer.maxMbInFlight 1024
> spark.shuffle.file.buffer.kb2048
> spark.storage.memoryFraction0.7
>
> The situation with disk usage is common for many jobs. I had also used ALS
> from MLIB and saw the similar things.
>
> I had reached no success by playing with spark configuration and i hope
> someone can help me :)
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi,

The files you mentioned are temporary files written by Spark during shuffling. 
ALS will write a LOT of those files as it is a shuffle heavy algorithm.
Those files will be deleted after your program completes as Spark looks for 
those files in case a fault occurs. Having those files ready allows Spark to 
continue from the stage the shuffle left off, instead of starting from the very 
beginning.

Long story short, it's to your benefit that Spark writes those files to disk. 
If you don't want Spark writing to disk, you can specify a checkpoint directory 
in
HDFS, where Spark will write the current status instead and will clean up files 
from disk.

Best,
Burak

- Original Message -
From: "Макар Красноперов" 
To: user@spark.apache.org
Sent: Wednesday, September 17, 2014 7:37:49 AM
Subject: Spark and disk usage.

Hello everyone.

The problem is that spark write data to the disk very hard, even if
application has a lot of free memory (about 3.8g).
So, I've noticed that folder with name like
"spark-local-20140917165839-f58c" contains a lot of other folders with
files like "shuffle_446_0_1". The total size of files in the dir
"spark-local-20140917165839-f58c" can reach 1.1g.
Sometimes its size decreases (are there only temp files in that folder?),
so the totally amount of data written to the disk is greater than 1.1g.

The question is what kind of data Spark store there and can I make spark
not to write it on the disk and just keep it in the memory if there is
enough RAM free space?

I run my job locally with Spark 1.0.1:
./bin/spark-submit --driver-memory 12g --master local[3] --properties-file
conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar

spark-defaults.conf :
spark.shuffle.spill false
spark.reducer.maxMbInFlight 1024
spark.shuffle.file.buffer.kb2048
spark.storage.memoryFraction0.7

The situation with disk usage is common for many jobs. I had also used ALS
from MLIB and saw the similar things.

I had reached no success by playing with spark configuration and i hope
someone can help me :)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi,

I'm not a master on SparkSQL, but from what I understand, the problem ıs that 
you're trying to access an RDD
inside an RDD here: val xyz = file.map(line => *** 
extractCurRate(sqlContext.sql("select rate ... *** and 
here:  xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate 
... ***.
RDDs can't be serialized inside other RDD tasks, therefore you're receiving the 
NullPointerException.

More specifically, you are trying to generate a SchemaRDD inside an RDD, which 
you can't do.

If file isn't huge, you can call .collect() to transform the RDD to an array 
and then use .map() on the Array.

If the file is huge, then you may do number 3 first, join the two RDDs using 
'txCurCode' as a key, and then do filtering
operations, etc...

Best,
Burak

- Original Message -
From: "rkishore999" 
To: u...@spark.incubator.apache.org
Sent: Saturday, September 13, 2014 10:29:26 PM
Subject: Spark SQL

val file =
sc.textFile("hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt")

1. val xyz = file.map(line => extractCurRate(sqlContext.sql("select rate
from CurrencyCodeRates where txCurCode = '" + line.substring(202,205) + "'
and fxCurCode = '" + fxCurCodesMap(line.substring(77,82)) + "' and
effectiveDate >= '" + line.substring(221,229) + "' order by effectiveDate
desc"))

2. val xyz = file.map(line => sqlContext.sql("select rate, txCurCode,
fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and
fxCurCode = 'CSD' and effectiveDate >= '20140901' order by effectiveDate
desc"))

3. val xyz = sqlContext.sql("select rate, txCurCode, fxCurCode,
effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode =
'CSD' and effectiveDate >= '20140901' order by effectiveDate desc")

xyz.saveAsTextFile("/user/output")

In statements 1 and 2 I'm getting nullpointer expecption. But statement 3 is
good. I'm guessing spark context and sql context are not going together
well.

Any suggestions regarding how I can achieve this?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp14183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Filter function problem

2014-09-09 Thread Burak Yavuz
Hi,

val test = persons.value
  .map{tuple => (tuple._1, tuple._2
  .filter{event => *inactiveIDs.filter(event2 => event2._1 == 
tuple._1).count() != 0})}

Your problem is right between the asterisk. You can't make an RDD operation 
inside an RDD operation, because RDD's can't be serialized. 
Therefore you are receiving the NullPointerException. Try joining the RDDs 
based on `event` and then filter based on that.

Best,
Burak

- Original Message -
From: "Blackeye" 
To: u...@spark.incubator.apache.org
Sent: Tuesday, September 9, 2014 3:34:58 AM
Subject: Re: Filter function problem

In order to help anyone to answer i could say that i checked the
inactiveIDs.filter operation seperated, and I found that it doesn't return
null in any case. In addition i don't how to handle (or check) whether a RDD
is null. I find the debugging to complicated to point the error. Any ideas
how to find the null pointer? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Filter-function-problem-tp13787p13789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Executor address issue: "CANNOT FIND ADDRESS" (Spark 0.9.1)

2014-09-08 Thread Burak Yavuz
Hi Nicolas,

It seems that you are starting to lose executors and then the job starts to 
fail. Can you please share more information about your application
so that we can help you debug it, such as what you're trying to do, and your 
driver logs please?

Best,
Burak

- Original Message -
From: "Nicolas Mai" 
To: u...@spark.incubator.apache.org
Sent: Monday, September 8, 2014 5:51:23 PM
Subject: Executor address issue: "CANNOT FIND ADDRESS" (Spark 0.9.1)

Hi,
One of the executors in my spark cluster shows a "CANNOT FIND ADDRESS"
address, for one of the stages which failed. After that stages, I got
cascading failures for all my stages :/ (stages that seem complete but still
appears as active stage in the dashboard; incomplete or failed stages that
are still in the active sections). Just a note that in the later stages,
there were no more "CANNOT FIND ADDRESS" issues.

Did anybody get this address issue and find a solution? Could this problem
explain the cascading failures?

Thanks!
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-address-issue-CANNOT-FIND-ADDRESS-Spark-0-9-1-tp13748.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi,

Spark uses by default approximately 60% of the executor heap memory to store 
RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of 
all the 8.6 GB of executor memory + the driver memory.

Best,
Burak

- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Thursday, August 28, 2014 6:32:32 PM
Subject: Memory statistics in the Application detail UI

Hi,

I am using a cluster where each node has 16GB (this is the executor memory).
After I complete an MLlib job, the executor tab shows the following:

Memory: 142.6 KB Used (95.5 GB Total) 

and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB 
(this is different for different nodes). What does the second number signify
(i.e.  8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of
the node, should it not be 17.3 KB/16 GB?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that 
method, just turn the map into an RDD:

`sc.parallelize(x.toSeq).saveAsTextFile(...)`

Reading through the api-docs will present you many more alternate solutions!

Best,
Burak

- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Thursday, August 28, 2014 12:45:22 PM
Subject: Re: OutofMemoryError when generating output

Hi,
Thanks for the response. I tried to use countByKey. But I am not able to
write the output to console or to a file. Neither collect() nor
saveAsTextFile() work for the Map object that is generated after
countByKey(). 

valx = sc.textFile(baseFile)).map { line =>
val fields = line.split("\t")
   (fields(11), fields(6)) // extract (month, user_id)
  }.distinct().countByKey()

x.saveAsTextFile(...)  // does not work. generates an error that
saveAstextFile is not defined for Map object


Is there a way to convert the Map object to an object that I can output to
console and to a file?

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer,

I've faced this issue before. They don't show up on 
http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: 
`sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")`
The gotcha is that you also need to supply which dataset you want: crawl, 
uservisits, or rankings in lower case after the format and size you want them 
in.
They should be there.

Best,
Burak

- Original Message -
From: "Sameer Tilak" 
To: user@spark.apache.org
Sent: Wednesday, August 27, 2014 11:42:28 AM
Subject: Amplab: big-data-benchmark

Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our 
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions 
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
 /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to 
doanload these datasets directly. Here is what I see. I read that they can be 
used directly by doing : sc.textFile(s3:/). However, I wanted to make sure 
that my understanding is correct. Here is what I see at 
http://s3.amazonaws.com/big-data-benchmark/
I do not see anything for sequence or text-deflate.
I see sequence-snappy dataset:
pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD
For text, I get the following error:
NoSuchKeyThe specified key does not 
exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI

Please let me know if there is a way to readily download the dataset and view 
it. 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, 

Your job is probably hanging on the groupByKey process. Probably GC is kicking 
in and the process starts to hang or the data is unbalanced and you end up with 
stragglers (Once GC kicks in you'll start to get the connection errors you 
shared). If you don't care about the list of values itself, but the count of it 
(that appears to be what you're trying to save, correct me if I'm wrong), then 
I would suggest using `countByKey()` directly on 
`JavaPairRDD partitioned`. 

Best, 
Burak 

- Original Message -

From: "David"  
To: "user"  
Sent: Tuesday, August 19, 2014 1:44:18 PM 
Subject: saveAsTextFile hangs with hdfs 

I have a simple spark job that seems to hang when saving to hdfs. When looking 
at the spark web ui, the job reached 97 of 100 tasks completed. I need some 
help determining why the job appears to hang. The job hangs on the 
"saveAsTextFile()" call. 


https://www.dropbox.com/s/fdp7ck91hhm9w68/Screenshot%202014-08-19%2010.53.24.png
 

The job is pretty simple: 

JavaRDD analyticsLogs = context 
.textFile(Joiner.on(",").join(hdfs.glob("/spark-dfs", ".*\\.log$")), 4); 

JavaRDD flyweights = analyticsLogs 
.map(line -> { 
try { 
AnalyticsLog log = GSON.fromJson(line, AnalyticsLog.class); 
AnalyticsLogFlyweight flyweight = new AnalyticsLogFlyweight(); 
flyweight.ipAddress = log.getIpAddress(); 
flyweight.time = log.getTime(); 
flyweight.trackingId = log.getTrackingId(); 
return flyweight; 

} catch (Exception e) { 
LOG.error("error parsing json", e); 
return null; 
} 
}); 


JavaRDD filtered = flyweights 
.filter(log -> log != null); 

JavaPairRDD partitioned = filtered 
.mapToPair((AnalyticsLogFlyweight log) -> new Tuple2<>(log.trackingId, log)) 
.partitionBy(new HashPartitioner(100)).cache(); 


Ordering ordering = 
Ordering.natural().nullsFirst().onResultOf(new Function() { 
public Long apply(AnalyticsLogFlyweight log) { 
return log.time; 
} 
}); 

JavaPairRDD> stringIterableJavaPairRDD 
= partitioned.groupByKey(); 
JavaPairRDD stringIntegerJavaPairRDD = 
stringIterableJavaPairRDD.mapToPair((log) -> { 
List sorted = Lists.newArrayList(log._2()); 
sorted.forEach(l -> LOG.info("sorted {}", l)); 
return new Tuple2<>(log._1(), sorted.size()); 
}); 

String outputPath = "/summarized/groupedByTrackingId4"; 
hdfs.rm(outputPath, true); 
stringIntegerJavaPairRDD.saveAsTextFile(String.format("%s/%s", hdfs.getUrl(), 
outputPath)); 


Thanks in advance, David 



Re: OutofMemoryError when generating output

2014-08-26 Thread Burak Yavuz
Hi,

The error doesn't occur during saveAsTextFile but rather during the groupByKey 
as far as I can tell. We strongly urge users to not use groupByKey
if they don't have to. What I would suggest is the following work-around:
sc.textFile(baseFile)).map { line => 
  val fields = line.split("\t")
  (fields(11), fields(6)) // extract (month, user_id)
}.distinct().countByKey()

instead

Best,
Burak


- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Tuesday, August 26, 2014 12:38:00 PM
Subject: OutofMemoryError when generating output

Hi,

I have the following piece of code that I am running on a cluster with 10
nodes with 2GB memory per node. The tasks seem to complete, but at the point
where it is generating output (saveAsTextFile), the program freezes after
some time and reports an out of memory error (error transcript attached
below). I also tried using collect() and printing the output to console
instead of a file, but got the same error. The program reads some logs for a
month and extracts the number of unique users during the month. The reduced
output is not very large, so not sure why the memory error occurs. I would
appreciate any help in fixing this memory error to get the output. Thanks.

def main (args: Array[String]) {

val conf = new SparkConf().setAppName("App")
val sc = new SparkContext(conf)

 // get the number of users per month
val user_time = sc.union(sc.textFile(baseFile))
   .map(line => {
 val fields = line.split("\t")
(fields(11), fields(6))
}) // extract (month, user_id)
  .groupByKey  // group by month as the key
  .map(g=> (g._1, g._2.toSet.size)) // get the
unique id count per month
 //  .collect()
 // user_time.foreach(f =>
println(f))
 user_time.map(f => "%s, %s".format(f._1,
f._2)).saveAsTextFile("app_output")
 sc.stop()
   }






14/08/26 15:21:15 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:121)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:107)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:106)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can 
checkout the code here and slice the part you need!

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala

Best,
Burak

- Original Message -
From: "athiradas" 
To: u...@spark.incubator.apache.org
Sent: Friday, August 22, 2014 4:14:34 AM
Subject: Re: Finding Rank in Spark

Does anyone knw a way to do this?

I tried it by sorting it and writing an auto increment function.

But since its parallel computing the result is wrong.

Is there anyway? please reply



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp12028p12647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476

LDA is on the roadmap for the 1.2 release, hopefully we will officially support 
it then!

Best,
Burak

- Original Message -
From: "Denny Lee" 
To: user@spark.apache.org
Sent: Thursday, August 21, 2014 10:10:35 PM
Subject: LDA example?

Quick question - is there a handy sample / example of how to use the LDA 
algorithm within Spark MLLib?  

Thanks!
Denny



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi,

// Initialize the optimizer using logistic regression as the loss function with 
L2 regularization
val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())

// Set the hyperparameters
lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)

// Retrieve the weights
val weightsWithIntercept = lbfgs.optimize(data, initialWeights)

//Slice weights with intercept into weight and intercept

//Initialize Logistic Regression Model
val model = new LogisticRegressionModel(weights, intercept)

model.predict(test) //Make your predictions

The example code doesn't generate the Logistic Regression Model that you can 
make predictions with.

`LBFGS.runMiniBatchLBFGS` outputs a tuple of (weights, lossHistory). The 
example code was for a benchmark, so it was interested more
in the loss history than the model itself.

You can also run
`val (weightsWithIntercept, localLoss) = LBFGS.runMiniBatchLBFGS ...`

slice `weightsWithIntercept` into the intercept and the rest of the weights and 
instantiate the model again as:
val model = new LogisticRegressionModel(weights, intercept)


Burak



- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Monday, August 11, 2014 11:52:04 AM
Subject: Re: [MLLib]:choosing the Loss function

Hi,

Thanks for the reference to the LBFGS optimizer. 
I tried to use the LBFGS optimizer, but I am not able to pass it  as an
input to the LogisticRegression model for binary classification. After
studying the code in mllib/classification/LogisticRegression.scala, it
appears that the  only implementation of LogisticRegression uses
GradientDescent as a fixed optimizer. In other words, I dont see a
setOptimizer() function that I can use to change the optimizer to LBFGS.

I tried to follow the code in
https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
that makes use of LBFGS, but it is not clear to me where  the
LogisticRegression  model with LBFGS is being returned that I can use for
the classification of the test dataset. 

If some one has sample code that uses LogisticRegression with LBFGS instead
of gradientDescent as the optimization algorithm, it would be helpful if you
can post it.

thanks 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Burak Yavuz
The following code will allow you to run Logistic Regression using L-BFGS:

val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)

val weights = lbfgs.optimize(data, initialWeights)

The different loss function support you are asking for is the `new 
LogisticGradient()` part. The different regularizations support
is the `new SquaredL2Updater()`

The supported loss functions are:
1) Logistic - LogisticGradient
2) LeastSquares - LeastSquaresGradient
3) Hinge - HingeGradient

The regularizers are:
0) No regularization - SimpleUpdater
1) L1 regularization - L1Updater
2) L2 regularization - SquaredL2Updater

You can find more here: 
http://spark.apache.org/docs/latest/mllib-linear-methods.html#loss-functions

I would suggest using L-BFGS rather than SGD as it's both much faster and more 
accurate.

Burak

- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 6:31:14 PM
Subject: [MLLib]:choosing the Loss function

Hi,

According to the MLLib guide, there seems to be support for different loss
functions. But I could not find a command line parameter to choose the loss
function but only found regType to choose the regularization. Does MLLib
support a parameter to choose  the loss function?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay,

I've had the same problem you've been having in Question 1 with a synthetic 
dataset. I thought I wasn't producing the dataset well enough. This seems to
be a bug. I will open a JIRA for it.

Instead of using:

ratings.map{ case Rating(u,m,r) => {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
  }
}.mean()

you can use something like:

val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings: RDD[(Double, Double)] = predictions.map{ x =>
  def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 
1.0), 0.0) else r
  ((x.user, x.product), mapPredictedRating(x.rating))
}.join(data.map(x => ((x.user, x.product), x.rating))).values

math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).mean())

This work around worked for me.

Regarding your question 2, it will be best of you do a special filtering of the 
dataset so that you do train for that user and product.
If we don't have any data trained on a user, there is no way to predict how he 
would like a product.
That filtering takes a lot of work though. I can share some code on that too if 
you like.

Best,
Burak

- Original Message -
From: "Jay Hutfles" 
To: user@spark.apache.org
Sent: Thursday, August 7, 2014 1:06:33 PM
Subject: questions about MLLib recommendation models

I have a few questions regarding a collaborative filtering model, and was
hoping for some recommendations (no pun intended...)

*Setup*

I have a csv file with user/movie/ratings named unimaginatively
'movies.csv'.  Here are the contents:

0,0,5
0,1,5
0,2,0
0,3,0
1,0,5
1,3,0
2,1,4
2,2,0
3,0,0
3,1,0
3,2,5
3,3,4
4,0,0
4,1,0
4,2,5

I then load it into an RDD with a nice command like

val ratings = sc.textFile("movies.csv").map(_.split(',') match { case
Array(u,m,r) => (Rating(u.toInt, m.toInt, r.toDouble))})

So far so good.  I'm even okay building a model for predicting the absent
values in the matrix with

val rank = 10
val iters = 20
val model = ALS.train(ratings, rank, iters)

I can then use the model to predict any user/movie rating without trouble,
like

model.predict(2, 0)

*Question 1: *

If I were to calculate, say, the mean squared error of the training set (or
to my next question, a test set), this doesn't work:

ratings.map{ case Rating(u,m,r) => {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
  }
}.mean()

Actually, any action on RDDs created by mapping over the RDD[Rating] with a
model prediction  fails, like

ratings.map{ case Rating(u, m, _) => model.predict(u, m) }.collect

I get errors due to a "scala.MatchError: null".  Here's the exact verbiage:


org.apache.spark.SparkException: Job aborted due to stage failure: Task
26150.0:1 failed 1 times, most recent failure: Exception failure in TID
7091 on host localhost: scala.MatchError: null

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)

org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

I think I'm missing something, since I can build up a scala collection of
the exact (user, movie) tuples I'm testing, map over that with the model
prediction, and it works fine.  But if I map over the RDD[Rating], it
doesn't.  Am I doing something obviously wrong?

*Question 2:*

I have a much

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi,

Could you try running spark-shell with the flag --driver-memory 2g or more if 
you have more RAM available and try again?

Thanks,
Burak

- Original Message -
From: "AlexanderRiggers" 
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 7:37:40 AM
Subject: KMeans Input Format

I want to perform a K-Means task and fail training the model and get kicked
out of Sparks scala shell before I get my result metrics. I am not sure if
the input format is the problem or something else. I use Spark 1.0.0 and my
input textile (400MB) looks like this:

86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703
10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99
12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064
3611 23.7 1.95 etc.

It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use
the first two, because in the MLlib example file there only use floats. So I
also tried the last two:

16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15
0.99

My error code in both cases is here:

scala> import org.apache.spark.mllib.clustering.KMeans import
org.apache.spark.mllib.clustering.KMeans

scala> import org.apache.spark.mllib.linalg.Vectors import
org.apache.spark.mllib.linalg.Vectors

scala>

scala> // Load and parse the data

scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO
MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744
14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB) data:
org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14

scala> val parsedData = data.map(s => Vectors.dense(s.split('
').map(_.toDouble))) parsedData:
org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[2] at map at :16

scala>

scala> // Cluster the data into two classes using KMeans

scala> val numClusters = 2 numClusters: Int = 2

scala> val numIterations = 20 numIterations: Int = 20

scala> val clusters = KMeans.train(parsedData, numClusters, numIterations)
14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07
16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07
16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260
14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at
KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07
16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at
KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final
stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List()
14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map
at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO
DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map
at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task
set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task
0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07
16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor
localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager:
Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO
TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost:
localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized
task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager:
Starting task 0.0:4 as TID 4 on executor localhost: localhost
(PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4
as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task
0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on
executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07
16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor:
Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5
14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO
Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID
3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08/07 16:15:39 INFO
BlockManager: Found block broadcast_0 locally 14/

Re: Regularization parameters

2014-08-06 Thread Burak Yavuz
Hi,

That is interesting. Would you please share some code on how you are setting 
the regularization type, regularization parameters and running Logistic 
Regression?

Thanks,
Burak

- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, August 6, 2014 6:18:43 PM
Subject: Regularization parameters

Hi,

I tried different regularization parameter values with Logistic Regression
for binary classification of my dataset and would like to understand the
following results:

regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80% 
regType = L1, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 50%

To calculate accuracy I am using 0.5 as threshold. prediction <0.5 is class
0, and prediction >= 0.5 is class 1.

regParam = 0.0, implies I am not using any regularization, is that correct?
If so, it should not matter whether I specify L1 or L2, I should get the
same results. So why is the accuracy value different? 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Naive Bayes parameters

2014-08-06 Thread Burak Yavuz
Hi,

Could you please send the link for the example you are talking about? 
minPartitions and numFeatures do not exist in the current API 
for NaiveBayes as far as I know. So, I don't know how to answer your second 
question.

Regarding your first question, guessing blindly, it should be related to 
numPartitions, which is the number of partitions your dataset consists of.
It is usually best to set this number to the number of cores your machine has. 
You can also try double the number of cores or half.

Best,
Burak

- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, August 6, 2014 3:45:09 PM
Subject: Naive Bayes parameters


1) How is the minPartitions parameter in NaiveBayes example used? What is
the default value?

2) Why is the  numFeatures specified as a parameter? Can this not be
obtained from the data? This parameter is not specified for the other MLlib
algorithms.  

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Bayes-parameters-tp11592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Burak Yavuz
Hi Tom,

Actually I was mistaken, sorry about that. Indeed on the website, the keys for 
the datasets you mention are not showing up. However, 
they are still accessible through the spark-shell, which means that they are 
there.

So in order to answer your questions:
- Are the tiny and 1node sets still available? 

Yes, they are.

- Are the Uservisits and Rankings still available?

Yes, they are.

- Why is the crawl set bigger than expected, and how big is it?

It says on the website that it is ~30 GB per node. Since you're downloading the 
5nodes version, the total size should be 150 GB.

Coming to other ways on you can download them:

I propose using the spark-shell would be easiest (At least for me it was :).

Once you start the spark-shell, you can access the files as (example for the 
tiny crawl dataset, exchange with 1node, 5nodes & uservisits, rankings as 
desired. Mind the lowercase):

val dataset = sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")

dataset.saveAsTextFile("your/local/relative/path/here")

The file will be saved relative to where you run the spark-shell from.

Hope this helps!
Burak


- Original Message -
From: "Tom" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, July 16, 2014 9:10:58 AM
Subject: Re: Retrieve dataset of Big Data Benchmark

Hi Burak,

Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.

After looking at the  Big Data Benchmark page
   (Section "Run this benchmark
yourself), I was expecting the following combination of files:
Sets: Uservisits, Rankings, Crawl
Size: tiny, 1node, 5node
Both in text and Sequence file.

When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see  
sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103
sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102
sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24
sequence-snappy/5nodes/crawl part 0 to 743

As "Crawl" is the name of a set I am looking for, I started to download it.
Since it was the end of the day and I was going to download it overnight, I
just wrote a for loop from 0 to 999 with wget, expecting it to download
until 7-something and 404 errors for the others. When I looked at it this
morning, I noticed that it all completed downloading. The total Crawl set
for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of
40G. 

This leads to my (sub)questions:
Does anybody know what exactly is still hosted:
- Are the tiny and 1node sets still available? 
- Are the Uservisits and Rankings still available?
- Why is the crawl set bigger than expected, and how big is it?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom,

If you wish to load the file in Spark directly, you can use 
sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your 
SparkContext. This can be
done because the files should be publicly available and you don't need AWS 
Credentials to access them.

If you want to download the file on your local drive: you can use the link 
http://s3.amazonaws.com/big-data-benchmark/pavlo/...

One note though, the tiny dataset doesn't seem to exist anymore. You can look 
at 
http://s3.amazonaws.com/big-data-benchmark/
to see the available files. ctrl+f tiny returned 0 matches.


Best,
Burak

- Original Message -
From: "Tom" 
To: u...@spark.incubator.apache.org
Sent: Tuesday, July 15, 2014 2:10:15 PM
Subject: Retrieve dataset of Big Data Benchmark

Hi,

I would like to use the dataset used in the  Big Data Benchmark
   on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried 
"bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./"
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have. 

Thanks in advance,

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Restarting a Streaming Context

2014-07-09 Thread Burak Yavuz
Someone can correct me if I'm wrong, but unfortunately for now, once a 
streaming context is stopped, it can't be restarted.  

- Original Message -
From: "Nick Chammas" 
To: u...@spark.incubator.apache.org
Sent: Wednesday, July 9, 2014 6:11:51 PM
Subject: Restarting a Streaming Context

So I do this from the Spark shell:

// set things up// 

ssc.start()
// let things happen for a few minutes

ssc.stop(stopSparkContext = false, stopGracefully = true)

Then I want to restart the Streaming Context:

ssc.start() // still in the shell; Spark Context is still alive

Which yields:

org.apache.spark.SparkException: StreamingContext has already been stopped

How come? Is there any way in the interactive shell to restart a Streaming
Context once it is stopped?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Restarting-a-Streaming-Context-tp9256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



<    1   2