This is something we (at Typesafe) also thought about, but didn't start
yet. It would be good to pool efforts.
On Sat, Jun 27, 2015 at 12:44 AM, Dave Ariens dari...@blackberry.com
wrote:
Fair. I will look into an alternative with a generated delegation token.
However the same issue exists.
/usr/bin/ - looks like strange directory. Did you copy some files to
/usr/bin yourself?
If you download (possible compile) spark - it will never be placed into
/usr/bin
On Sun, Jun 28, 2015 at 9:19 AM, Wojciech Pituła w.pit...@gmail.com wrote:
I assume that /usr/bin/load-spark-env.sh exists.
I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights to
execute it?
niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni asoni.le...@gmail.com
napisał:
Not sure what is the issue but when i run the spark-submit or spark-shell
i am getting below error
/usr/bin/spark-class:
HI,
I'm a junior user of spark from China.
I have a problem about submit spark job right now. I want to submit job
from code.
In other words ,How to submit spark job from within java program to yarn
cluster without using spark-submit
I've learnt from official site
Ayman - it's really a question of recommending user to products vs products
to users. There will only be a difference if you're not doing All to All.
For example, if you're recommending only the Top N recommendations. Then
you may recommend only the top N products or the top N users which would be
On 27 Jun 2015, at 07:56, Tim Chen
t...@mesosphere.iomailto:t...@mesosphere.io wrote:
Does YARN provide the token through that env variable you mentioned? Or how
does YARN do this?
Roughly:
1. client-side launcher creates the delegation tokens and adds them as byte[]
data to the the
Can you show us your code around line 100 ?
Which Spark release are you compiling against ?
Cheers
On Sun, Jun 28, 2015 at 5:49 AM, Arthur Chan arthur.hk.c...@gmail.com
wrote:
Hi,
I am trying Spark with some sample programs,
In my code, the following items are imported:
import
Oops - code should be :
Val a = rdd.zipWithIndex().filter(s = 1 s._2 100)
On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin ilgan...@gmail.com wrote:
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.
For
I've heard Spark is not just MapReduce mentioned during Spark talks, but it
seems like every method that Spark has is really doing something like (Map
- Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
performance benefit of keeping RDDs in memory between stages.
Am I wrong
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.
For example to select the first 100 elements :
Val a = rdd.zipWithIndex().filter(s = 1 s 100)
On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat
Thanks Ilya
Is there an advantage of say partitioning by users /products when you train ?
Here are two alternatives I have
#Partition by user or Product
tot = newrdd.map(lambda l:
(l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
ratings = tot.values()
model = ALS.train(ratings,
Hi,
I am trying Spark with some sample programs,
In my code, the following items are imported:
import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD,
LabeledPoint}
import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD}
import
Running this now
./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package
Waiting for it to complete. There is no progress after initial log messages
//LOGS
$ ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn
Any thoughts on this ?
On Fri, Jun 26, 2015 at 2:27 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
It used to work with 1.3.1, however with 1.4.0 i get the following
exception
export SPARK_HOME=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4
export
maven command needs to be passed through --mvn option.
Cheers
On Sun, Jun 28, 2015 at 12:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Running this now
./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package
Waiting
./make-distribution.sh --tgz --*mvn* -Phadoop-2.4 -Pyarn
-Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
or
./make-distribution.sh --tgz --*mvn* -Phadoop-2.4 -Pyarn
-Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Both fail with
+
you need 1) to publish to inhouse maven, so your application can depend on
your version, and 2) use the spark distribution you compiled to launch your
job (assuming you run with yarn so you can launch multiple versions of
spark on same cluster)
On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
I am able to use blockjoin API and it does not throw compilation error
val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.blockJoin(viEvents,1,1).map {
}
Here viEvents is highly skewed and both are on HDFS.
What should be the optimal values of replication, i
a blockJoin spreads out one side while replicating the other. i would
suggest replicating the smaller side. so if lstgItem is smaller try 3,1 or
else 1,3. this should spread the fat keys out over multiple (3)
executors...
On Sun, Jun 28, 2015 at 5:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
You mentioned storage levels must be
(should be memory-and-disk or disk-only), number of partitions (should be
large, multiple of num executors),
how do i specify that ?
On Sun, Jun 28, 2015 at 2:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am able to use blockjoin API and it does not
I ran this w/o maven options
./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive
-Phive-thriftserver
I got this spark-1.4.0-bin-2.4.0.tgz in the same working directory.
I hope this is built with 2.4.x hadoop as i did specify -P
On Sun, Jun 28, 2015 at 1:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
How can i import this pre-built spark into my application via maven as i
want to use the block join API.
On Sun, Jun 28, 2015 at 1:31 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I ran this w/o maven options
./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive
-Phive-thriftserver
I
spark is partitioner aware, so it can exploit a situation where 2 datasets
are partitioned the same way (for example by doing a map-side join on
them). map-red does not expose this.
On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote:
I've heard Spark is not just MapReduce
I incremented the version of spark from 1.4.0 to 1.4.0.1 and ran
./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive
-Phive-thriftserver
Build was successful but the script faild. Is there a way to pass the
incremented version ?
[INFO] BUILD SUCCESS
[INFO]
Vanilla map/reduce does not expose it: but hive on top of map/reduce has
superior partitioning (and bucketing) support to Spark.
2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com:
spark is partitioner aware, so it can exploit a situation where 2 datasets
are partitioned the same way
Spark comes with quite a few components. At it's core is..surprisespark
core. This provides the core things required to run spark jobs. Spark provides
a lot of operators out of the box...take a look at
Hey sparkers,
I'm trying to use Logback for logging from my Spark jobs but I noticed that
if I submit the job with spark-submit then the log4j implementation of
slf4j is loaded instead of logback. Consequently, any call to
org.slf4j.LoggerFactory.getLogger will return a log4j logger instead of a
Hi,
I want to deploy my application on a standalone cluster.
Spark submit acts in strange way. When I deploy the application in
*client* mode, everything works well and my application can see the
additional jar files.
Here is the command:
spark-submit --master spark://1.2.3.4:7077
You can use the following command to build Spark after applying the pull
request:
mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
Cheers
On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I see that block support did not make it to spark 1.4 release.
Can
Few doubts :
In 1.2 streaming when I use union of streams , my streaming application
getting hanged sometimes and nothing gets printed on driver.
[Stage 2:
(0 + 2) / 2]
Whats is 0+2/2 here signifies.
1.Does no of streams in
I would also add, from a data locality theoretic standpoint, mapPartitions()
provides for node-local computation that plain old map-reduce does not.
From my Android phone on T-Mobile. The first nationwide 4G network.
Original message
From: Ashic Mahtab as...@live.com
Date:
I would like to get a sense of spark YARN cluster used around and this
thread can help others as well
1. Number of nodes in cluster
2. Container memory limit
3. Typical Hardware configuration of worker nodes
4. Typical number of executors used ?
5. Any other related info you want to share.
How
I just did that, where can i find that spark-1.4.0-bin-hadoop2.4.tgz file
?
On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu yuzhih...@gmail.com wrote:
You can use the following command to build Spark after applying the pull
request:
mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
Cheers
Regarding # of executors.
I get 342 executors in parallel each time and i set executor-cores to 1.
Hence i need to set 342 * 2 * x (x = 1,2,3, ..) as number of partitions
while running blockJoin. Is this correct.
And is my assumptions on replication levels correct.
Did you get a chance to look
Hi,
line 99:model.trainOn(labeledStream)
line 100: model.predictOn(labeledStream).print()
line 101:ssc.start()
line 102: ssc.awaitTermination()
Regards
On Sun, Jun 28, 2015 at 10:53 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you show us your code around line 100 ?
Which
You are trying to predict on a DStream[LabeledPoint] (data + labels) but
predictOn expects a DStream[Vector] (just the data without the labels).
Try doing:
val unlabeledStream = labeledStream.map { x = x.features }
model.predictOn(unlabeledStream).print()
On Sun, Jun 28, 2015 at 6:03 PM, Arthur
Could you please suggest and help me understand further.
This is the actual sizes
-sh-4.1$ hadoop fs -count dw_lstg_item
1 764 2041084436189
/sys/edw/dw_lstg_item/snapshot/2015/06/25/00
*This is not skewed there is exactly one etntry for each item but its 2TB*
So should
specify numPartitions or partitioner for operations that shuffle.
so use:
def join[W](other: RDD[(K, W)], numPartitions: Int)
or
def blockJoin[W](
other: JavaPairRDD[K, W],
leftReplication: Int,
rightReplication: Int,
partitioner: Partitioner)
for example:
left.blockJoin(right, 3, 1,
My code:
val viEvents = details.filter(_.get(14).asInstanceOf[Long] !=
NULL_VALUE).map
{ vi = (vi.get(14).asInstanceOf[Long], vi) } //AVRO (150G)
val lstgItem = DataUtil.getDwLstgItem(sc,
DateUtil.addDaysToDate(startDate, -89)).filter(_.getItemId().toLong !=
NULL_VALUE).map { lstg =
I am unable to run my application or sample application with prebuilt spark
1.4 and wit this custom 1.4. In both cases i get this error
15/06/28 15:30:07 WARN ipc.Client: Exception encountered while connecting
to the server : java.lang.IllegalArgumentException: Server has invalid
Kerberos
regarding your calculation of executors... RAM in executor is not really
comparable to size on disk.
if you read from from file and write to file you do not have to set storage
level.
in the join or blockJoin specify number of partitions as a multiple (say 2
times) of number of cores available
other people might disagree, but i have had better luck with a model that
looks more like traditional map-red if you use spark for disk-to-disk
computations: more cores per executor (and so less RAM per core/task). so i
would suggest trying --executor-cores 4 and adjust numPartitions
accordingly.
Figured it out.
All the jars that are specified with driver-class-path are now exported
through SPARK_CLASSPATH and its working now.
I thought SPARK_CLASSPATH was dead. Looks like its flipping ON/OFF
On Sun, Jun 28, 2015 at 12:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Any thoughts on
val hadoopConf = new Configuration(sc.hadoopConfiguration)
hadoopConf.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)
sc.hadoopConfiguration(hadoopConf)
or
sc.hadoopConfiguration = hadoopConf
threw error.
On Sun, Jun 28, 2015 at 9:32 PM, Ted Yu
I can do this
val hadoopConf = new Configuration(sc.hadoopConfiguration)
*hadoopConf.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)*
sc.newAPIHadoopFile(
path + /*.avro,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]],
sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)
sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith(_)).get
+ /*, classOf[Text], classOf[Text])
works
On Sun, Jun 28, 2015 at 9:46 PM, Ted Yu yuzhih...@gmail.com wrote:
There isn't setter for
There isn't setter for sc.hadoopConfiguration
You can directly change value of parameter in sc.hadoopConfiguration
However, see the note in scaladoc:
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not
to modify it unless you
* plan to set some global configurations for
sequenceFile() calls hadoopFile() where:
val confBroadcast = broadcast(new
SerializableConfiguration(hadoopConfiguration))
You can set the parameter in sc.hadoopConfiguration before calling
sc.sequenceFile().
Cheers
On Sun, Jun 28, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Hi Ping,
FYI, we just merged Feynman's PR:
https://github.com/apache/spark/pull/6997 that adds sequential pattern
support. Please check out master branch and help test. Thanks!
Best,
Xiangrui
On Wed, Jun 24, 2015 at 2:16 PM, Feynman Liang fli...@databricks.com wrote:
There is a JIRA for this
49 matches
Mail list logo