Re: PySpark on PyPi

2015-08-11 Thread westurner
Matt Goodman wrote I would tentatively suggest also conda packaging. http://conda.pydata.org/docs/ $ conda skeleton pypi pyspark # update git_tag and git_uri # add test commands (import pyspark; import pyspark.[...]) Docs for building conda packages for multiple operating systems and

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
The bulk load code is 14150 if u r interested. Let me know how it can be made faster. It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, Thanks for

答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very similar to Astro with a difference of this being part of HBase client library; while Astro works as a Spark package so will evolve and function more closely with Spark

Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very

答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ted, I’m in China now, and seem to experience difficulty to access Apache Jira. Anyways, it appears to me that HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to support Spark DataFrame inside HBase. If true, one question to me is whether HBase is intended to have a

答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ok. Then a question will be to define a boundary between a query engine and a built-in processing. If, for instance, the Spark DataFrame functionalities involving shuffling are to be supported inside HBase, in my opinion, it’d be hard not to tag it as an query engine. If, on the other hand,

Re: [discuss] Removing individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin
This is now done with this pull request: https://github.com/apache/spark/pull/8091 Committers please update the script to get this feature. On Mon, Jul 20, 2015 at 12:28 AM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: +1 Sounds like a great idea. On Sun, Jul 19, 2015 at 10:54 PM,

Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
HBase will not have query engine. It will provide better support to query engines. Cheers On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, I’m in China now, and seem to experience difficulty to access Apache Jira. Anyways, it appears to me that

Re: Pushing Spark to 10Gb/s

2015-08-11 Thread Akhil Das
Hi Starch, It also depends on the applications behavior, some might not be properly able to utilize the network. If you are using say Kafka, then one thing that you should keep in mind is the Size of the individual message and the number of partitions that you are having. The higher the message

Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think. + dev list Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote: Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical

Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs write), but currently it is used for all the stages. Is there any reason for that ? -- Best Regards Jeff Zhang

Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi, Recently, I use spark sql to do join on non-equality condition, condition1 or condition2 for example. Spark will use broadcastNestedLoopJoin to do this. Assume that one of dataframe(df1) is not created from hive table nor local collection and the other one is created from hivetable(df2). For

Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi My Spark job (running in local[*] with spark 1.4.1) reads data from a thrift server(Created an RDD, it will compute the partitions in getPartitions() call and in computes hasNext will return records from these partitions), count(), foreach() is working fine it returns the correct number of

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
There a number of ways to bulk load. There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which is spark shuffle bulk load. Let me know if I have missed a bulk loading option. All these r possible with the new hbase-spark module. As for the filter push down discussion in

Re: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi, Thanks a lot. The problem is not do non-equal join for large tables, in fact, one table is really small and another one is huge. The problem is that spark can only get the correct size for dataframe created directly from hive table. Even we create a dataframe from local collection, it uses

答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
We are using MR-based bulk loading on Spark. For filter pushdown, Astro does partition-pruning, scan range pruning, and use Gets as much as possible. Thanks, 发件人: Ted Malaska [mailto:ted.mala...@cloudera.com] 发送时间: 2015年8月12日 9:14 收件人: Yan Zhou.sc 抄送: dev@spark.apache.org; Bing Xiao (Bing);

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
Hi Josh, I mean on the driver side. OutputCommitCorrdinator.startStage is called in DAGScheduler#submitMissingTasks for all the stages (cost some memory). Although it is fine that as long as executor side don't call RPC, there's no much performance penalty. On Wed, Aug 12, 2015 at 12:17 AM, Josh

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ted, Thanks for pointing out more details of HBase-14181. I am afraid I may still need to learn more before I can make very accurate and pointed comments. As for filter push down, Astro has a powerful approach to basically break down arbitrarily complex logic expressions comprising of

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
No, Astro bulkloader does not use its own shuffle. But map/reduce-side processing is somewhat different from HBase’s bulk loader that are used by many HBase apps I believe. From: Ted Malaska [mailto:ted.mala...@cloudera.com] Sent: Wednesday, August 12, 2015 8:56 AM To: Yan Zhou.sc Cc:

Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Pala M Muthaia
Thanks for the pointers. Yes, i started with changing the hive.group property in pom and started seeing various dependency issues. Initially i thought spark-project.hive was just a pom for uber jars that pull in hive classes without transitive dependencies like kryo, but looks like lot more

RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN. Currently, for the non-equal join, if the join type is the INNER join, then it will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the outer joins. In the BroadcastnestedLoopJoin, the table

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March. , or better yet for interested parties to run performance comparisons by themselves for now. As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug