We should change the trait to abstract class, and then your problem will go
away.
Do you want to submit a pull request?
On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera
wrote:
> Hi,
>
> this follows the following feature in this feature [1]
>
> I'm trying to implement a custom persistence engi
Hi,
this follows the following feature in this feature [1]
I'm trying to implement a custom persistence engine and a leader agent in
the Java environment.
vis-a-vis scala, when I implement the PersistenceEngine trait in java, I
would have to implement methods such as readPersistedData, removeDri
I have the real DEBS-TAxi data in csv file , in order to operate over it
how to simulate a "Spout" kind of thing as event generator using the
timestamps in CSV file.
--
SERC-IISC
Thanks & Regards,
Anshu Shukla
We definitely still have the name collision problem in SQL.
On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal wrote:
> Do we still have to keep the names of the functions distinct to avoid
> collisions in SQL? Or is there a plan to allow "importing" a namespace into
> SQL somehow?
>
> I ask b
Do we still have to keep the names of the functions distinct to avoid
collisions in SQL? Or is there a plan to allow "importing" a namespace into
SQL somehow?
I ask because if we have to keep worrying about name collisions then I'm
not sure what the added complexity of #2 and #3 buys us.
Punya
On
I am working on it. Here is the (very rough) version:
https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions
On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal
wrote:
> Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
> Jira search earlier. I
Actually I'm doing some cleanups related to type coercion, and I will take
care of this.
On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb wrote:
> OK, I'll try.
> On Apr 30, 2015 06:54, "Reynold Xin" wrote:
>
>> We added ExpectedInputConversion rule recently in analysis:
>> https://github.com/apach
OK, I'll try.
On Apr 30, 2015 06:54, "Reynold Xin" wrote:
> We added ExpectedInputConversion rule recently in analysis:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L647
>
> With this rule, the analyzer aut
I agree, Ewan.
We should also look into combining both Flink and Spark into one.This eases the
industry adaptation instead.
Thanking you.
With Regards
Sree
On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs
wrote:
Hi all,
A quick question about Tungsten. The announcement of the Tun
We added ExpectedInputConversion rule recently in analysis:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L647
With this rule, the analyzer automatically adds cast for expressions that
inherit ExpectsInputTypes
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python
is the main problem ...
See
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:
> My feeli
My feeling is that we should have a handful of namespaces (say 4 or 5). It
becomes too cumbersome to import / remember more package names and having
everything in one package makes it hard to read scaladoc etc.
Thanks
Shivaram
On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin wrote:
> To add a littl
To add a little bit more context, some pros/cons I can think of are:
Option 1: Very easy for users to find the function, since they are all in
org.apache.spark.sql.functions. However, there will be quite a large number
of them.
Option 2: I can't tell why we would want this one over Option 3, sinc
Before we make DataFrame non-alpha, it would be great to decide how we want
to namespace all the functions. There are 3 alternatives:
1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
functions.
2. Ha
Spark currently doesn't allocate any memory off of the heap for shuffle
objects. When the in-memory data gets too large, it will write it out to a
file, and then merge spilled filed later.
What exactly do you mean by store shuffle data in HDFS?
-Sandy
On Tue, Apr 14, 2015 at 10:15 AM, Kannan Ra
I guess you can use cast(id as String) instead of just id in your where
clause ?
Le mer. 29 avr. 2015 à 12:13, lonely Feb a écrit :
> Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter
> difference between HIVE and Spark SQL that our sql has a statement like:
>
> select A
To give you a broader idea of the current use case, I have a few
transformations (sort and column creations) oriented towards a simple goal.
My data is timestamped and if two lines are identical, that time difference
will have to be more than X days in order to be kept, so there are a few
shifts do
In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.
It would be good to get a sense of the use case you're trying to support,
but one suggestion
In this case it's fine to discuss whether this would fit in Spark
DataFrames' high level direction before putting it in JIRA. Otherwise we
might end up creating a lot of tickets just for querying whether something
might be a good idea.
About this specific feature -- I'm not sure what it means in g
I can't comment on the direction of the DataFrame API (that's more for
Reynold or Michael I guess), but I just wanted to point out that the JIRA
would be the recommended way to create a central place for discussing a
feature add like that.
Nick
On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
o
Hi Nicholas,
yes I've already checked, and I've just created the
https://issues.apache.org/jira/browse/SPARK-7247
I'm not even sure why this would be a good feature to add except the fact
that some of the data scientists I'm working with are using it, and it
would be therefore useful for me to tran
You can check JIRA for any existing plans. If there isn't any, then feel
free to create a JIRA and make the case there for why this would be a good
feature to add.
Nick
On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:
> Hi,
> Is there any plan to add the
Hi reynold,
It took me some time, but I've finally found that there is a difference
between spilling on the map-side and spilling on the reduce-side for a
shuffle. Spilling to disk on the map-side happens by default (with the
spillToPartitionFiles call from insertAll in ExternalSorter; don't know
Hi Juan, Daniel,
thank you for your explanations. Indeed, I don't have a big number of keys,
at least not enough to stuck the scheduler.
I was using a method quite similar as what you post, Juan, and yes it
works, but I think this would be more efficient to not call filter on each
key. So, I was
Hi Daniel,
I understood Sébastien was talking having having a high number of keys, I
guess I was prejudiced by my own problem! :) Anyway I don't think you need
to use disk or a database to generate a RDD per key, you can use filter
which I guess would be more efficient because IO is avoided, espec
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method
for saving the RDD into separate files by key in a single pass. Then you
can read the files into separate RDDs.
On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:
> Hi Sébasti
Hi Sébastien,
I came with a similar problem some time ago, you can see the discussion in
the Spark users mailing list at
http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results
. My experience was that when you create too many RDDs the Spark scheduler
gets stu
Hi,
Is there any plan to add the "shift" method from Pandas to Spark Dataframe,
not that I think it's an easy task...
c.f.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
Regards,
Olivier.
No, I am not collect the result to driver,I sample send the result to kafka.
BTW, the image address are:
https://cloud.githubusercontent.com/assets/5170878/7389463/ac03bf34-eea0-11e4-9e6b-1d2fba170c1c.png
and
https://cloud.githubusercontent.com/assets/5170878/7389480/c629d236-eea0-11e4-983a-dc5
Hello,
I'm facing a problem with custom RDD transformations.
I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map
of RDD by key.
This would be great, for example, in order to process mllib clustering on V
values grouped by K.
I know I could do it using filter() on my RDD a
Hi, Dear developer, I am using Spark Streaming to read data from kafka, the
program already run about 120 hours, but today the program failed because of
driver's OOM as follow:
Container [pid=49133,containerID=container_1429773909253_0050_02_01] is
running beyond physical memory limits. Cu
Hi all,
A quick question about Tungsten. The announcement of the Tungsten
project is on the back of Hadoop Summit in Brussels where some of the
Flink devs were giving talks [1] on how Flink manages memory using byte
arrays and the like to avoid the overhead of all the Java types[2]. Is
there a
The problem was in my hive-13 branch. So ignore this.
On 27 April 2015 at 10:34, Manku Timma wrote:
> I am facing an exception "Hive.get() called without a hive db setup" in
> the executor. I wanted to understand how Hive object is initialized in the
> executor threads? I only see Hive.get(hivec
Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter
difference between HIVE and Spark SQL that our sql has a statement like:
select A from B where id regexp '^12345$'
in HIVE it works fine but in Spark SQL we got a:
java.lang.ClassCastException: java.lang.Long cannot be cas
34 matches
Mail list logo