I suggest searching the archives for this list as there were several
previous discussions about this problem. JIRA also has several issues
related to this.
Some pointers:
- SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431:
Parallelize Scala/Java test execution
-
Hi all,
I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really
confused me today. At first I thought my implementation is wrong. It turns
out it's an issue in MLlib. Fortunately, I've figured it out.
I suggest to add a hint on user document of MLlib ( as far as I know,
Hi Hrishikesh,
Seems the behavior of Kafka-assembly is a little different when using Maven
to sbt. The assembly jar name and location is different while using `mvn
package`. This is a actually bug, I'm fixing this now.
Thanks
Jerry
2015-04-22 13:37 GMT+08:00 Hrishikesh Subramonian
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?
On Wed, Apr 22,
It is actually different.
coalesce expression is to pick the first value that is not null:
https://msdn.microsoft.com/en-us/library/ms190349.aspx
Would be great to update the documentation for it (both Scala and Java) to
explain that it is different from coalesce function on a DataFrame/RDD. Do
Anecdotally, there are a number of people asking to set the Assignee
field. This is currently restricted to Committers in JIRA. I know the
logic was to prevent people from Assigning a JIRA and then leaving it;
it also matters a bit for questions of credit.
Still I wonder if it's best to just let
One over arching issue is that it's pretty unclear what Assigned to
X in JIAR means from a process perspective. Personally I actually
feel it's better for this to be more historical - i.e. who ended up
submitting a patch for this feature that was merged - rather than
creating an exclusive
Hi all,
It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never
unpersist it. In the master branch it's here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181
In 1.3.1 it's here:
Hi Jim,
You're right; that should be unpersisted. Could you please create a JIRA
and submit a patch?
Thanks!
Joseph
On Wed, Apr 22, 2015 at 6:00 PM, jimfcarroll jimfcarr...@gmail.com wrote:
Hi all,
It appears GradientBoostedTrees.scala can call 'persist' on an RDD and
never
unpersist it.
Hi Chunnan,
There is currently Scala documentation for the constructor parameters:
https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515
There is one benefit to not checking for validity (ordering)
To repeat what Patrick said (literally):
If an issue is “assigned” to person X, but some other person Y submits
a great patch for it, I think we have some obligation to Spark users
and to the community to merge the better patch. So the idea of
reserving the right to add a feature, it just seems
Last one for the day.
Everyone, as I said clearly, I was not alluding to anything fishy in
practice, I was describing how things go wrong in such an environment. Sandy's
email lays down some of these problems.
Assigning a JIRA in other projects is not a reservation. It is a clear
intention
I watch these lists, so I have a fair understanding of how things work around
here. I don't give direct input in the day to day activities though, like Greg
Stein on the other thread, so I can understand if it looks like it came from up
above. Apache Members come around and give opinions time
Sandy - I definitely agree with that. We should have a convention of
signaling someone intends to work - for instance by commenting on the
JIRA and we should document this on the contribution guide. The nice
thing about having that convention is that multiple people can say
they are going to work
I think one of the benefits of assignee fields that I've seen in other
projects is their potential to coordinate and prevent duplicate work. It's
really frustrating to put a lot of work into a patch and then find out that
someone has been doing the same. It's helpful for the project etiquette to
I can get behind that point of view too. That's what I've told people
who expect Assignee is a necessary part of workflow. The existence of
a PR link is a signal someone's working on it.
In that case we need not do anything.
On Wed, Apr 22, 2015 at 8:32 PM, Patrick Wendell pwend...@gmail.com
Actually what this community got away with is pretty much an anti-pattern
compared to every other Apache project I have seen. And may I say in a not so
Apache way.
Waiting for a committer to assign a patch to someone leaves it as a privilege
to a committer. Not alluding to anything fishy in
Woh hold on a minute.
Spark has been among the projects that are the most welcoming to new
contributors. And thanks to this, the sheer number of activities in Spark
is much larger than other projects, and our workflow has to accommodate
this fact.
In practice, people just create pull requests on
Hi Vinod,
Thanks for you thoughts - However, I do not agree with your sentiment
and implications. Spark is broadly quite an inclusive project and we
spend a lot of effort culturally to help make newcomers feel welcome.
- Patrick
On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
Agreed. The Spark project and community that Vinod describes do not
resemble the ones with which I am familiar.
On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote:
Hi Vinod,
Thanks for you thoughts - However, I do not agree with your sentiment
and implications. Spark
I think you misread the thread, since that's the opposite of what
Patrick suggested. He's suggesting that *nobody ever waits* to be
assigned a JIRA to work on it; that anyone may work on a JIRA without
waiting for it to be assigned.
The point is: assigning JIRAs discourages others from doing work
If it is true what you say, what is the reason for this
committer-only-assigns-JIRA tickets policy? If anyone can send a pull request,
anyone should be able to assign tickets to himself/herself too.
+Vinod
On Apr 22, 2015, at 1:18 PM, Reynold Xin
r...@databricks.commailto:r...@databricks.com
As a contributor, I¹ve never felt shut out from the Spark community, nor
have I seen any examples of territorial behavior. A few times I¹ve
expressed interest in more challenging work and the response I received
was generally ³go ahead and give it a shot, just understand that this is
sensitive
There were some PR's about graphical representation with D3.js, you can
possibly see it on the github. Here's a few of them
https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3
Thanks
Best Regards
On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com
wrote:
Dear
Hi,
Looks interesting.
It is quite interesting to know about what could have been the reason for
not showing these stats in UI.
As per the description of Patrick W in
https://spark-project.atlassian.net/browse/SPARK-999, it does not mention
any exception w.r.t failed tasks/executors.
Can
Thanks for the pointers! It looks like others are pretty active on this so
I'll comment on those PRs and try to coordinate before starting any new
work.
Punya
On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com
wrote:
There were some PR's about graphical representation with
Anyone?
On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
Hi Olivier,
*the update function is as below*:
*val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
Long)]) = {*
* val previousCount = state.getOrElse((0L, 0L))._2*
*
Where should this *coalesce* come from ? Is it related to the partition
manipulation coalesce method ?
Thanks !
Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit :
Ah ic. You can do something like
df.select(coalesce(df(a), lit(0.0)))
On Mon, Apr 20, 2015 at 1:44 PM,
I think I found the Coalesce you were talking about, but this is a catalyst
class that I think is not available from pyspark
Regards,
Olivier.
Le mer. 22 avr. 2015 à 11:56, Olivier Girardot
o.girar...@lateral-thoughts.com a écrit :
Where should this *coalesce* come from ? Is it related to
Replacement for production-ish is beyond a stretch phrasing, UX just isn’t
there yet for average end user wanting push-button.
Up until a bit ago focus was heavily focused on infrastructure folks and people
building their own distros. Project is turning towards end users so anyone
from ops to
I agree, it's what I did :)
I was just wondering if it was considered a problem or something to work
on, I personally think so because the feedback loop should be as quick as
possible, and therefore if there was someone I could help.
Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a
Hi,
I came across documentation for creating a pipeline in mlib library of
pyspark. I wanted to know if something similar exists for pyspark input
transformations. I have a use case where I have my input files in different
formats and would like to convert them to rdd and store them in memory and
32 matches
Mail list logo