Hi, everyone? ? I come across with a problem about data locality, i found
these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val
locData = InputFormatInfo.computePreferredLocations(Seq(new
InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))?? ???
Hi, everyone I come across a problem with changing the patition number of
the rdd, my code is as below: val rdd1 = sc.textFile(path1) val rdd2 =
sc.textFile(path2)
val rdd3 = sc.textFile(path3)
val imeiList = parseParam(job.jobParams)
val broadcastVar =
Thank you for your reply,
Actually, we have already used this parameter. Our cluster is a
standalone cluster with 16 nodes, every node has 16 cores. We have 256 pairs
matrices along with 256 tasks , when we set --total-executor-cores as 64,
each node can launch 4 tasks simultaneously, each
Hi
First conf is used for Hadoop to determine the locality distribution of HDFS
file. Second conf is used for Spark, though with the same name, actually they
are two different classes.
Thanks
Jerry
From: qinwei [mailto:wei@dewmobile.net]
Sent: Sunday, September 28, 2014 2:05 PM
To: user
If increasing executors really isn't enough, then you can consider using
mapPartitions to process whole partitions at a time. Within that you can
multi thread your processing of the elements in the partition. (And you
should probably use more like one worker per machine then.)
The question is how
Thank you for your reply,
? ? I understand your explaination, but i wonder what is the?correct usage of
the apinew SparkContext(config: SparkConf,
preferredNodeLocationData: Map[String, Set[SplitInfo]])how to construct the
second param?preferredNodeLocationData?hope for
(Most of this code is not relevant to the question and can be refactored
too. The casts and null checks look unnecessary.)
You are unioning RDDs so you have a result with the sum of their
partitions. The number of partitions is really a hint to Hadoop only so it
is not even necessarily 3 x 1920.
Hi guys,
This is a spark fresh user...
I'm trying to setup a spark cluster with multiple nodes, starting with 2.
With one node, it is working fine. When I get a slave node, slave is able
to register to the master node. However when I launch a spark shell, and
when the executor is launched on the
Thank you for your reply, and your tips on code refactoring is helpful, after a
second look on the code, the casts and null check is really unnecessary.
qinwei
From: Sean OwenDate: 2014-09-28 15:03To: qinweiCC: userSubject: Re: problem
with patitioning(Most of this code is not relevant
BTW, I'm using standalone deployment (The name standalone deployment for
cluster, is kind of misleading... I think the doc needs to be updated.
It's not really standalone, but plain spark only deployment)
Thx,
cody
On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc codeoe...@gmail.com wrote:
Hi
I cannot find it in the documentation. And I have a dozen dimension tables
to (left) join...
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
Hi
If you want IDEA compile your spark project (version 1.0.0 and above), you
should do it with following steps.
1 clone spark project
2 use mvn to compile your spark project ( because you need the generated avro
source file in flume-sink module)
3 open spark/pom.xml with IDEA
4 check profiles
Hi
We have used LogisticRegression with two different optimization method SGD
and LBFGS in MLlib.
With the same dataset and the same training and test split, but get
different weights vector.
For example, we use
spark-1.1.0/data/mllib/sample_binary_classification_data.txt
as our training and
All
Sorry this is spark related, but I thought some of you in San Francisco
might be interested in this talk. We announced this talk recently, it will be
at the end of next month (oct)
http://www.meetup.com/sfmachinelearning/events/208078582/
Prof CJ Lin is famous for his work on libsvm
Have you looked at SPARK-1800 ?
e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
Cheers
On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I cannot find it in the documentation. And I have a dozen dimension tables
to (left) join...
Cheers,
Yes, looks like it can only be controlled by the
parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
to me.
How am I suppose to know the exact bytes of a table? Let me specify the
join algorithm is preferred I think.
Jianshi
On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu
It turned out a bug in my code. In the select clause the list of fields is
misaligned with the schema of the target table. As a consequence the map
data couldn’t be cast to some other type in the schema.
Thanks anyway.
On 9/26/14, 8:08 PM, Cheng Lian lian.cs@gmail.com wrote:
Would you mind
Hi All,
I am interested to collect() a large RDD so that I can run a learning
algorithm on it. I've noticed that when I don't increase
SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
looks like the same fraction of memory is reserved for storage on the
driver as on the
Can anybody confirm whether or not view is currently supported in spark? I
found “create view translate” in the blacklist of HiveCompatibilitySuite.scala
and also the following scenario threw NullPointerException on
beeline/thriftserver (1.1.0). Any plan to support it soon?
create table
Views are not supported yet. Its not currently on the near term roadmap,
but that can change if there is sufficient demand or someone in the
community is interested in implementing them. I do not think it would be
very hard.
Michael
On Sun, Sep 28, 2014 at 11:59 AM, Du Li
Thanks, Michael, for your quick response.
View is critical for my project that is migrating from shark to spark SQL. I
have implemented and tested everything else. It would be perfect if view could
be implemented soon.
Du
From: Michael Armbrust
Hi Spark users and developers,
Some of the most active Spark developers (including Matei Zaharia, Michael
Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for
Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to
host a meetup event. This might be the event
The storage fraction only limits the amount of memory used for storage. It
doesn't actually limit anything else. I.e you can use all the memory if you
want in collect.
On Sunday, September 28, 2014, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I am interested to collect() a large RDD
Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD
there.
But the storage level is Memory Deserialized 1x Replicated. How can I change
the storage level? Because I have a big table there.
Thanks!
From: Cheng Lian
This is not possible until https://github.com/apache/spark/pull/2501 is
merged.
On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang hw...@qilinsoft.com wrote:
Thanks for the response. From Spark Web-UI's Storage tab, I do see
cached RDD there.
But the storage level is Memory Deserialized 1x
You might consider instead storing the data using saveAsParquetFile and
then querying that after running
sqlContext.parquetFile(...).registerTempTable(...).
On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust mich...@databricks.com
wrote:
This is not possible until
Figured this out... documented here and hope can help others:
http://koobehub.wordpress.com/2014/09/29/spark-the-standalone-cluster-deployment/
On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc codeoe...@gmail.com wrote:
Hi guys,
This is a spark fresh user...
I'm trying to setup a spark cluster
Chris,
Think I will check back with you to see if you made progress on this issue.
Any good news so far? Thanks. Once again, I really appreciate you look into
this issue.
Thanks,
Wei
On Thu, Aug 28, 2014 at 4:44 PM, Chris Fregly ch...@fregly.com wrote:
great question, wei. this is very
Thanks Cheng.
For the time being , As a work around, I had applied the schema
to Queryresult1, and then registered the result as temp table. Although
that works, but I was not sure of performance impact, as that might block
some optimisation in some scenarios.
This flow (on spark 1.1 ) works:
This workaround looks good to me. In this way, all queries are still
executed lazily within a single DAG, and Spark SQL is capable to
optimize the query plan as a whole.
On 9/29/14 11:26 AM, twinkle sachdeva wrote:
Thanks Cheng.
For the time being , As a work around, I had applied the schema
30 matches
Mail list logo