Out of curiosity : are there functional limitations in Spark Standalone
that are of concern? Yarn is more configurable for running non-spark
workloads and how to run multiple spark jobs in parallel. But for a single
spark job it seems standalone launches more quickly and does not miss any
Please do not send advertisements on this channel.
On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:
> Hi All,
>
> Is anyone looking for a spark scala contract role inside the USA? A
> company called Maxonic has an open spark scala contract position
I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.
On Sat, 10 Oct 2020 at 13:03, Mich
{ println(it) }
}
}
So that shows some of the niceness of kotlin: intuitive type conversion
`to`/`to` and `dsOf( list)`- and also the inlining of the side
effects. Overall concise and pleasant to read.
On Tue, 14 Jul 2020 at 12:18, Stephen Boesch wrote:
> I started with scala/spark in
I started with scala/spark in 2012 and scala has been my go-to language for
six years. But I heartily applaud this direction. Kotlin is more like a
simplified Scala - with the benefits that brings - than a simplified java.
I particularly like the simplified / streamlined collections classes.
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
afaik It has been there since Spark 2.0 in 2015. Not certain about Spark
1.5/1.6
On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan
wrote:
> I first ran the command
> df.show()
>
> For sanity check of my dataFrame.
>
> I wasn't impressed with the display.
>
> I then ran
> df.toPandas() in Jupiter
the predicates are typically sql's.
Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch :
> Hi Mich!
>I think you can combine the good/rejected into one method that
> internally:
>
>- Create good/rejected df's given an input df and input
>rules/predicates to apply to the
Hi Mich!
I think you can combine the good/rejected into one method that
internally:
- Create good/rejected df's given an input df and input rules/predicates
to apply to the df.
- Create a third df containing the good rows and the rejected rows with
the bad columns nulled out
-
The warning signs were there from the first email sent from that person. I
wonder is there any way to deal with this more proactively.
Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:
> good for you. right move
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
>
I have been using Idea for both scala/spark and pyspark projects since
2013. It required fair amount of fiddling that first year but has been
stable since early 2015. For pyspark projects only Pycharm naturally also
works v well.
Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes :
>
>
same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote:
>
>> Sounds like you have done your homework to
Sounds like you have done your homework to properly compare . I'm
guessing the answer to the following is yes .. but in any case: are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?
Am Di., 10. Sept. 2019
There are several high bars to getting a new algorithm adopted.
* It needs to be deemed by the MLLib committers/shepherds as widely useful
to the community. Algorithms offered by larger companies after having
demonstrated usefulness at scale for use cases likely to be encountered
by many
Consider the following *intended* sql:
select row_number()
over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,*
from flights
This will *not* work in *structured streaming* : The culprit is:
partition by Origin
The requirement is to use a timestamp-typed field such as
There are several suggestions on this SOF
https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor
1
You need to convert the final value to a python list. You implement the
function as follows:
def uniq_array(col_array):
x =
You might have better luck downloading the 2.4.X branch
Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal :
> Then are the mlib of spark compatible with scala 2.12? Or can I change the
> spark version from spark3.0 to 2.3 or 2.4 in local spark/master?
>
>
>
> --
> Sent from:
I think scala 2.11 support was removed with the spark3.0/master
Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal :
> I am trying to build my spark using build/sbt package, after changing the
> scala versions to 2.11 in pom.xml because my applications jar files use
> scala 2.11. But
So the LogisticRegression with regParam and elasticNetParam set to 0 is not
what you are looking for?
https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression
.setRegParam(0.0)
.setElasticNetParam(0.0)
Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado
sues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797>
[image: javadba]Stephen Boesch
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> added
a comment - 03/May/18 17:08
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given
jupyter notebook and there is a Cell 3 that contains some Spark DataFrame
operations .. Then :
- what is the relationship does the %%spark magic and the toree kernel?
- how does the %%spark magic get applied to that
(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
2018-05-07 10:30 GMT-07:00 Stephen Boesch <java...@gmail.com>:
> I am intermittently running into guava dependency issues across mutiple
> spark projects. I have tried
I am intermittently running into guava dependency issues across mutiple
spark projects. I have tried maven shade / relocate but it does not
resolve the issues.
The current project is extremely simple: *no* additional dependencies
beyond scala, spark, and scalatest - yet the issues remain (and
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote:
>
>> Do you have a machine with terabytes of RAM? afaik collect() requires
>> RAM - so that would be your limiting factor.
>>
>> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail
Do you have a machine with terabytes of RAM? afaik collect() requires RAM
- so that would be your limiting factor.
2018-04-28 8:41 GMT-07:00 klrmowse :
> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use
While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml .. and by an *order of magnitude*. The following is one of the
tables - it is for Logistic Regression. At that time spark.ML did not yet
support SVM
From: https://bdataanalytics.biomedcentral.com/articles/10.
Hi Richard, this is not a jobs board: please only discuss spark application
development issues.
2017-12-21 8:34 GMT-08:00 Richard L. Burton III :
> I'm trying to locate four independent contractors who have experience with
> Spark. I'm not sure where I can go to find
I have been testing on the 20 NewsGroups dataset - which the Spark docs
themselves reference. I can confirm that perplexity increases and
likelihood decreases as topics increase - and am similarly confused by
these results.
2017-09-28 10:50 GMT-07:00 Cody Buntain :
> Hi,
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a
number of comments identical to the following:
* @note This ignores instance weights (setting all to 1.0) from
`LogisticRegression.weightCol`.
* This will change in later Spark versions.
Are there any plans to address this?
Hi Mich, the github link has a brief intro - including a link to the formal
docs http://logisland.readthedocs.io/en/latest/index.html . They have an
architectural overview, developer guide, tutorial, and pretty comprehensive
api docs.
2017-10-24 13:31 GMT-07:00 Mich Talebzadeh
@Vadim Would it be true to say the `.rdd` *may* be creating a new job -
depending on whether the DataFrame/DataSet had already been materialized
via an action or checkpoint? If the only prior operations on the
DataFrame had been transformations then the dataframe would still not have
been
repo
- The local maven repo is included by default - so should not need to
do anything special there
The same errors from the original post continue to occur.
2017-10-11 20:05 GMT-07:00 Stephen Boesch <java...@gmail.com>:
> A clarification here: the example is being run *from
n install and
> define your local maven repo in SBT?
>
> -Paul
>
> Sent from my iPhone
>
> On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> When attempting to run any example program w/ Intellij I am running into
> guava ver
When attempting to run any example program w/ Intellij I am running into
guava versioning issues:
Exception in thread "main" java.lang.NoClassDefFoundError:
com/google/common/cache/CacheLoader
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at
Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> > On Aug 10, 2017, at 1:46 PM, Stephen Boesch <java...@gmail.com> wrote:
> >
> >
> > While the DataFrame/DataSets are useful in many circumstances they are
> cumbersome for many ty
While the DataFrame/DataSets are useful in many circumstances they are
cumbersome for many types of complex sql queries.
Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations
- for version 2.2?
An example of what is not clear: what constructs are supported within
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance. Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.
2017-06-25 19:18 GMT-07:00 Ryan :
> Why would you like to
You would need to use *native* Cassandra API's in each Executor -
not org.apache.spark.sql.cassandra.CassandraSQLContext
- including to create a separate Cassandra connection on each Executor.
2017-05-28 15:47 GMT-07:00 Abdulfattah Safa :
> So I can't run SQL queries in
Jupyter with toree works well for my team. Jupyter is well more refined vs
zeppelin as far as notebook features and usability: shortcuts, editing,etc.
The caveat is it is better to run a separate server instanace for
python/pyspark vs scala/spark
2017-05-17 19:27 GMT-07:00 Richard Moorhead
Anyone have this working - either in 1.X or 2.X?
thanks
For now I have added to the log4j.properties:
log4j.logger.org.apache.parquet=ERROR
2017-02-18 11:50 GMT-08:00 Stephen Boesch <java...@gmail.com>:
> The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X
> STILL leaves an "avalanche" of
The following JIRA mentions that a fix made to read parquet 1.6.2 into
2.X STILL leaves an "avalanche" of warnings:
https://issues.apache.org/jira/browse/SPARK-17993
Here is the text inside one of the last comments before it was merged:
I have built the code from the PR and it indeed
Would it be possible to share that communication? I am interested in this
thread.
2016-12-30 11:02 GMT-08:00 Ji Yan :
> Thanks Michael, Tim and I have touched base and thankfully the issue has
> already been resolved
>
> On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt
This problem appears to be a regression on HEAD/master: when running
against 2.0.2 the pyspark job completes successfully including running
predictions.
2016-11-23 19:36 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>
> For a pyspark job with 54 executors all of the task outputs h
For a pyspark job with 54 executors all of the task outputs have a single
line in both the stderr and stdout similar to:
Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/
Note: the directory /shared/sparkmaven/work exists and is owned by the same
user running the
While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler. If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
What is the state of the spark-packages project(s) ? When running a query
for machine learning algorithms the results are not encouraging.
https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22
There are 62 packages. Only a few have actual releases - and even less with
dates in the past
It is private. You will need to put your code in that same package or
create an accessor to it living within that package
private[spark]
2016-11-03 16:04 GMT-07:00 Yanwei Zhang :
> I would like to use some matrix operations in the BLAS object defined in
> ml.linalg.
You would likely want to create inline views that perform the filtering *before
*performing t he cubes/rollup; in this way the cubes/rollups only operate
on the pruned rows/columns.
2016-11-03 11:29 GMT-07:00 Andrés Ivaldi :
> Hello, I need to perform some aggregations and a
I also did not understand why the Logging class was made private in Spark
2.0. In a couple of projects including CaffeOnSpark the Logging class was
simply copied to the new project to allow for backwards compatibility.
2016-06-28 18:10 GMT-07:00 Michael Armbrust :
> I'd
My team has a custom optimization routine that we would have wanted to plug
in as a replacement for the default LBFGS / OWLQN for use by some of the
ml/mllib algorithms.
However it seems the choice of optimizer is hard-coded in every algorithm
except LDA: and even in that one it is only a
out.write(Opcodes.REDUCE)
^
2016-06-22 23:49 GMT-07:00 Stephen Boesch <java...@gmail.com>:
> Thanks Jeff - I remember that now from long time ago. After making that
> change the next errors are:
>
> Error:scalac: missing or invalid dependency detected while lo
o
> spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
> under build path, this is the only thing you need to do manually if I
> remember correctly.
>
>
>
> On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>>
ang <zjf...@gmail.com>:
> It works well with me. You can try reimport it into intellij.
>
> On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com>
> wrote:
>
>>
>> Building inside intellij is an ever moving target. Anyone have the
>> magical procedu
Building inside intellij is an ever moving target. Anyone have the magical
procedures to get it going for 2.X?
There are numerous library references that - although included in the
pom.xml build - are for some reason not found when processed within
Intellij.
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the
latter seems to be close to having support for Spark 2.X.
While I am interested in using Spark Notebook as soon as that support were
available are there alternatives that work *now*? For example some
unmerged -yet -working
There are around twenty data generators in mllib -none of which are
presently migrated to ml.
Here is an example
/**
* :: DeveloperApi ::
* Generate sample data used for SVM. This class generates uniform random values
* for the features and adds Gaussian noise with weight 0.1 to generate
What are you expecting us to do? Yash provided a reasonable approach -
based on the info you had provided in prior emails. Otherwise you can
convert it from python to spark - or find someone else who feels
comfortable to do it. That kind of inquiry would likelybe appropriate on a
job board.
How many workers (/cpu cores) are assigned to this job?
2016-06-09 13:01 GMT-07:00 SRK :
> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet at a
> time using Spark SQL? It seems to be not performant when I try to insert
> 2000 directories of
ooc are the tables partitioned on a.pk and b.fk? Hive might be using
copartitioning in that case: it is one of hive's strengths.
2016-06-09 7:28 GMT-07:00 Gourav Sengupta :
> Hi Mich,
>
> does not Hive use map-reduce? I thought it to be so. And since I am
> running
e/SPARK-7159
> On May 28, 2016 9:31 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>
>> Thanks Phuong But the point of my post is how to achieve without using
>> the deprecated the mllib pacakge. The mllib package already has
>> multinomial regression buil
ogisticGradient is in the mllib package, not ml
> package. I just want to say that we can build a multinomial logistic
> regression model from the current version of Spark.
>
> Regards,
>
> Phuong
>
>
>
> On Sun, May 29, 2016 at 12:04 AM, Stephen Boes
gradient and loss
> for a multinomial logistic regression. That is, you can train a
> multinomial logistic regression model with LogisticGradient and a
> class to solve optimization like LBFGS to get a weight vector of the
> size (numClassrd-1)*numFeatures.
>
>
> Phuong
>
>
>
Followup: just encountered the "OneVsRest" classifier in
ml.classsification: I will look into using it with the binary
LogisticRegression as the provided classifier.
2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>:
>
> Presently only the mllib version has t
Presently only the mllib version has the one-vs-all approach for
multinomial support. The ml version with ElasticNet support only allows
binary regression.
With feature parity of ml vs mllib having been stated as an objective for
2.0.0 - is there a projected availability of the multinomial
e PR, check Spark's API
> documentation.
>
>
> On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch <java...@gmail.com> wrote:
> >
> > There is a committed PR from Marcelo Vanzin addressing that capability:
> >
> > https://github.com/apache/spark/pull/3916/files
&
There is a committed PR from Marcelo Vanzin addressing that capability:
https://github.com/apache/spark/pull/3916/files
Is there any documentation on how to use this? The PR itself has two
comments asking for the docs that were not answered.
Which Resource Manager are you using?
2016-01-20 21:38 GMT-08:00 Renu Yadav :
> Any suggestions?
>
> On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav wrote:
>
>> Hi ,
>>
>> I am facing spark task scheduling delay issue in spark 1.4.
>>
>> suppose I have 1600
Alternating least squares takes an RDD of (user/product/ratings) tuples
and the resulting Model provides predict(user, product) or predictProducts
methods among others.
The postgres jdbc driver needs to be added to the classpath of your spark
workers. You can do a search for how to do that (multiple ways).
2015-12-22 17:22 GMT-08:00 b2k70 :
> I see in the Spark SQL documentation that a temporary table can be created
> directly onto a
y other things that I have to do that you can think of?
>
> Thanks,
> Ben
>
>
> On Dec 22, 2015, at 6:25 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> The postgres jdbc driver needs to be added to the classpath of your spark
> workers. You can do a search for how
There are solid reasons to have built spark on the jvm vs python. The
question for Daniel appear to be at this point scala vs java8. For that
there are many comparisons already available: but in the case of working
with spark there is the additional benefit for the scala side that the core
@Yu Fengdong: Your approach - specifically the groupBy results in a
shuffle does it not?
2015-12-04 2:02 GMT-08:00 Fengdong Yu :
> There are many ways, one simple is:
>
> such as: you want to know how many rows for each month:
>
>
>
r.t. building locally, please specify -Pscala-2.11
>
> Cheers
>
> On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>> HI Madabhattula
>> Scala 2.11 requires building from source. Prebuilt binaries are
>> available only for sca
HI Madabhattula
Scala 2.11 requires building from source. Prebuilt binaries are
available only for scala 2.10
>From the src folder:
dev/change-scala-version.sh 2.11
Then build as you would normally either from mvn or sbt
The above info *is* included in the spark docs but a little hard
>> and then use the Hive's dynamic partitioned insert syntax
What does this entail? Same sql but you need to do
set hive.exec.dynamic.partition = true;
in the hive/sql context (along with several other related dynamic
partition settings.)
Is there anything else/special
The following works against a hive table from spark sql
hc.sql("select id,r from (select id, name, rank() over (order by name) as
r from tt2) v where v.r >= 1 and v.r <= 12")
But when using a standard sql context against a temporary table the
following occurs:
Exception in thread "main"
Checked out 1.6.0-SNAPSHOT 60 minutes ago
2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>:
> Which version of spark are you using?
>
>
>
> *From:* Stephen Boesch [mailto:java...@gmail.com]
> *Sent:* Thursday, 19 November 2015 2:12 PM
> *To:* user
> *Su
But to focus the attention properly: I had already tried out 1.5.2.
2015-11-18 19:46 GMT-08:00 Stephen Boesch <java...@gmail.com>:
> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>
> 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>:
>
>> Which version of spark ar
Why is the same query (and actually i tried several variations) working
against a hivecontext and not against the sql context?
2015-11-18 19:57 GMT-08:00 Michael Armbrust <mich...@databricks.com>:
> Yes they do.
>
> On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch <java..
ust the last thing that happens to
> fail.
>
> On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote:
> >
> > For a week or two the trunk has not been building for the examples module
> > within intellij. The other modules - including core, sql,
For a week or two the trunk has not been building for the examples module
within intellij. The other modules - including core, sql, mllib, etc *are *
working.
A portion of the error message is
"Unable to get dependency information: Unable to read the metadata file for
artifact
Hi Michel, please try local[1] and report back if the breakpoint were hit.
2015-09-18 7:37 GMT-07:00 Michel Lemay :
> Hi,
>
> I'm adding unit tests to some utility functions that are using
> SparkContext but I'm unable to debug code and hit breakpoints when running
> under
Yes, adding that flag does the trick. thanks.
2015-09-10 13:47 GMT-07:00 Sean Owen <so...@cloudera.com>:
> -Dtest=none ?
>
>
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests
>
> On Thu, Sep 10, 2015 at
I have invoked mvn test with the -DwildcardSuites option to specify a
single BinarizerSuite scalatest suite.
The command line is
mvn -pl mllib -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11
-Dmaven.javadoc.skip=true
-DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test
The scala
.
FYI
On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com
wrote:
I am building spark with the following options - most notably the
**scala-2.11**:
. dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip
I am building spark with the following options - most notably the
**scala-2.11**:
. dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip=true clean package
The build goes pretty far but fails in one of the minor modules
The NoClassDefFoundException differs from ClassNotFoundException : it
indicates an error while initializing that class: but the class is found in
the classpath. Please provide the full stack trace.
2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com:
Hello, I am just starting out with
Given the following command line to spark-submit:
bin/spark-submit --verbose --master local[2]--class
org.yardstick.spark.SparkCoreRDDBenchmark
/shared/ysgood/target/yardstick-spark-uber-0.0.1.jar
Here is the output:
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes
when using spark-submit: which directory contains third party libraries
that will be loaded on each of the slaves? I would like to scp one or more
libraries to each of the slaves instead of shipping the contents in the
application uber-jar.
Note: I did try adding to $SPARK_HOME/lib_managed/jars.
One option is the databricks/spark-perf project
https://github.com/databricks/spark-perf
2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com:
Hi all,
What is the most common used tool/product to benchmark spark job?
The following errors are occurring upon building using mvn options clean
package
Are there some requirements/restrictions on profiles/settings for catalyst
to build properly?
[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:138:
value
Vanilla map/reduce does not expose it: but hive on top of map/reduce has
superior partitioning (and bucketing) support to Spark.
2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com:
spark is partitioner aware, so it can exploit a situation where 2 datasets
are partitioned the same way
Oryx 2 has a scala client
https://github.com/OryxProject/oryx/blob/master/framework/oryx-api/src/main/scala/com/cloudera/oryx/api/
2015-06-20 11:39 GMT-07:00 Debasish Das debasish.da...@gmail.com:
After getting used to Scala, writing Java is too much work :-)
I am looking for scala based
I downloaded the 1.3.1 distro tarball
$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve staff 8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz
However the build on it is failing with an unresolved dependency:
*configuration
not public*
$ build/sbt assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4
(the same btw applies for the Node where you run the driver app – all
other nodes must be able to resolve its name)
*From:* Stephen Boesch [mailto:java...@gmail.com]
*Sent:* Wednesday, May 20, 2015 10:07 AM
*To:* user
*Subject:* Intermittent difficulties for Worker to contact Master on
same
TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:44:03 steve FINISHED 6 s
app-20150527123822- TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:38:22 steve FINISHED 6 s
2015-05-27 11:42 GMT-07:00 Stephen Boesch java...@gmail.com:
Thanks Yana,
My current
What conditions would cause the following delays / failure for a standalone
machine/cluster to have the Worker contact the Master?
15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
http://10.0.0.3:8081
15/05/20 02:02:53 INFO Worker: Connecting to master
Hi Ricardo,
providing the error output would help . But in any case you need to do a
collect() on the rdd returned from computeCost.
2015-05-19 11:59 GMT-07:00 Ricardo Goncalves da Silva
ricardog.si...@telefonica.com:
Hi,
Can anybody see what’s wrong in this piece of code:
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven
requirements are much lower , around 1.7GB. So try using maven .
For reference I have the following settings and both do compile. sbt would
not work with lower values.
$echo $SBT_OPTS
-Xmx3012m -XX:MaxPermSize=512m
1 - 100 of 144 matches
Mail list logo