Matt Goodman wrote
I would tentatively suggest also conda packaging.
http://conda.pydata.org/docs/
$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])
Docs for building conda packages for multiple operating systems and
The bulk load code is 14150 if u r interested. Let me know how it can be
made faster.
It's just a spark shuffle and writing hfiles. Unless astro wrote it's own
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Ted,
Thanks for
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package
so will evolve and function more closely with Spark
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?
Cheers
On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no
design doc mentioned. Superficially it is very
Ted,
I’m in China now, and seem to experience difficulty to access Apache Jira.
Anyways, it appears to me that
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a
Ok. Then a question will be to define a boundary between a query engine and a
built-in processing. If, for instance, the Spark DataFrame functionalities
involving shuffling are to be supported inside HBase,
in my opinion, it’d be hard not to tag it as an query engine. If, on the other
hand,
This is now done with this pull request:
https://github.com/apache/spark/pull/8091
Committers please update the script to get this feature.
On Mon, Jul 20, 2015 at 12:28 AM, Manoj Kumar
manojkumarsivaraj...@gmail.com wrote:
+1
Sounds like a great idea.
On Sun, Jul 19, 2015 at 10:54 PM,
HBase will not have query engine.
It will provide better support to query engines.
Cheers
On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Ted,
I’m in China now, and seem to experience difficulty to access Apache Jira.
Anyways, it appears to me that
Hi Starch,
It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical
As my understanding, OutputCommitCoordinator should only be necessary for
ResultStage (especially for ResultStage with hdfs write), but currently it
is used for all the stages. Is there any reason for that ?
--
Best Regards
Jeff Zhang
Hi,
Recently, I use spark sql to do join on non-equality condition, condition1
or condition2 for example.
Spark will use broadcastNestedLoopJoin to do this. Assume that one of
dataframe(df1) is not created from hive table nor local collection and the
other one is created from hivetable(df2). For
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of
There a number of ways to bulk load.
There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.
Let me know if I have missed a bulk loading option. All these r possible
with the new hbase-spark module.
As for the filter push down discussion in
Hi,
Thanks a lot.
The problem is not do non-equal join for large tables, in fact, one table
is really small and another one is huge.
The problem is that spark can only get the correct size for dataframe
created directly from hive table. Even we create a dataframe from local
collection, it uses
We are using MR-based bulk loading on Spark.
For filter pushdown, Astro does partition-pruning, scan range pruning, and use
Gets as much as possible.
Thanks,
发件人: Ted Malaska [mailto:ted.mala...@cloudera.com]
发送时间: 2015年8月12日 9:14
收件人: Yan Zhou.sc
抄送: dev@spark.apache.org; Bing Xiao (Bing);
Hi Josh,
I mean on the driver side. OutputCommitCorrdinator.startStage is called in
DAGScheduler#submitMissingTasks for all the stages (cost some memory).
Although it is fine that as long as executor side don't call RPC, there's
no much performance penalty.
On Wed, Aug 12, 2015 at 12:17 AM, Josh
Ted,
Thanks for pointing out more details of HBase-14181. I am afraid I may still
need to learn more before I can make very accurate and pointed comments.
As for filter push down, Astro has a powerful approach to basically break down
arbitrarily complex logic expressions comprising of
No, Astro bulkloader does not use its own shuffle. But map/reduce-side
processing is somewhat different from HBase’s bulk loader that are used by many
HBase apps I believe.
From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc:
Thanks for the pointers. Yes, i started with changing the hive.group
property in pom and started seeing various dependency issues.
Initially i thought spark-project.hive was just a pom for uber jars that
pull in hive classes without transitive dependencies like kryo, but looks
like lot more
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN.
Currently, for the non-equal join, if the join type is the INNER join, then it
will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the
outer joins.
In the BroadcastnestedLoopJoin, the table
Hey Yan,
I've been the one building out this spark functionality in hbase so maybe I
can help clarify.
The hbase-spark module is just focused on making spark integration with
hbase easy and out of the box for both spark and spark streaming.
I and I believe the hbase team has no desire to build
We have not “formally” published any numbers yet. A good reference is a slide
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by
themselves for now.
As for status quo of Astro, we have been focusing on fixing bugs (UDF-related
bug
23 matches
Mail list logo