How this unit test passed on master trunk?

2016-04-22 Thread Yong Zhang
Hi, I was trying to find out why this unit test can pass in Spark code. inhttps://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala for this unit test: test("Star Expansion - CreateStruct and CreateArray") { val structDf =

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

RE: how does sc.textFile translate regex in the input.

2016-04-13 Thread Yong Zhang
It is described in "Hadoop Definition Guild", chapter 3, FilePattern https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns Yong From: pradeep1...@gmail.com Date: Wed, 13 Apr 2016 18:56:58 + Subject: how does sc.textFile translate regex in

RE: Logging in executors

2016-04-13 Thread Yong Zhang
Is the env/dev/log4j-executor.properties file within your jar file? Is the path matching with what you specified as env/dev/log4j-executor.properties? If you read the log4j document here: https://logging.apache.org/log4j/1.2/manual.html When you specify the

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
pr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote: If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jorn

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;

RE: ordering over structs

2016-04-06 Thread Yong Zhang
1) Is a struct in Spark like a struct in C++? Kinda. Its an ordered collection of data with known names/types. 2) What is an alias in this context? it is assigning a name to the column. similar to doing AS in sql. 3) How does this code even work? Ordering

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
that each partition from 1 DF join with partition with same key of DF 2 on the worker node without shuffling the data.In other words do as much as work within worker node before shuffling the data. ThanksDarshan Singh On Wed, Apr 6, 2016 at 10:06 PM, Yong Zhang <java8...@hotmail.com>

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
Thanks On Wed, Apr 6, 2016 at 8:37 PM, Yong Zhang <java8...@hotmail.com> wrote: What you are looking for is https://issues.apache.org/jira/browse/SPARK-4849 This feature is available in Spark 1.6.0, so the DataFrame can reuse the partitioned data in the join. For you case in 1.5.x, you have

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
me know if you need further information. On Tue, Apr 5, 2016 at 6:33 PM, Yong Zhang <java8...@hotmail.com> wrote: You need to show us the execution plan, so we can understand what is your issue. Use the spark shell code to show how your DF is built, how you partition them, then use expl

RE: Partition pruning in spark 1.5.2

2016-04-05 Thread Yong Zhang
Hi, Michael: I would like to ask the same question, if the DF hash partitioned, then cache, now query/filter by the column which hashed for partition, will Spark be smart enough to do the Partition pruning in this case, instead of depending on Parquet's partition pruning. I think that is the

RE: Plan issue with spark 1.5.2

2016-04-05 Thread Yong Zhang
You need to show us the execution plan, so we can understand what is your issue. Use the spark shell code to show how your DF is built, how you partition them, then use explain(true) on your join DF, and show the output here, so we can better help you. Yong > Date: Tue, 5 Apr 2016 09:46:59

RE: spark.driver.memory meaning

2016-04-03 Thread Yong Zhang
In the standalone mode, it applies to the Driver JVM processor heap size. You should consider giving enough memory space to it, in standalone mode, due to: 1) Any data you bring back to the driver will store in it, like RDD.collect or DF.show2) The Driver also host a web UI for the application

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Yong Zhang
Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out? It will be good for a test writing to RAM Disk if that configuration is available. Thanks Yong From: r...@databricks.com Date: Fri, 1 Apr 2016 15:32:23 -0700

RE: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Yong Zhang
I agree that there won't be a generic solution for these kind of cases. Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark DataFrame/SQL should support more hints from the end user, as in these cases, end users will be smart enough to tell the engine what is the correct

Spark 1.5.2 Master OOM

2016-03-30 Thread Yong Zhang
Hi, Sparkers Our cluster is running Spark 1.5.2 with Standalone mode. It runs fine for weeks, but today, I found out the master crash due to OOM. We have several ETL jobs runs daily on Spark, and adhoc jobs. I can see the "Completed Applications" table grows in the master UI. Original I set

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-24 Thread Yong Zhang
.com > CC: user@spark.apache.org > > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang <java8...@hotmail.com> wrote: > > Here is the output: > > > > == Parsed Logical Plan == > > Project [400+ columns] > > +- Project [400+ columns] > >+- Project [400+ columns] &g

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
t; From: dav...@databricks.com > To: java8...@hotmail.com > CC: user@spark.apache.org > > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang <java8...@hotmail.com> wrote: > > Here is the output: > > > > == Parsed Logical Plan == > > Project [400+ columns] > &g

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
; > The broadcast hint does not work as expected in this case, could you > also how the logical plan by 'explain(true)'? > > On Wed, Mar 23, 2016 at 8:39 AM, Yong Zhang <java8...@hotmail.com> wrote: > > > > So I am testing this code to understand "broadcast&qu

Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Yong Zhang
When I tried to compile the Spark 1.5.2 with -Phive-0.12.0, maven gave me back an error that profile doesn't exist any more. But when I read the Spark SQL programming guide here: http://spark.apache.org/docs/1.5.2/sql-programming-guide.htmlIt keeps mentioning Spark 1.5.2 still can work with

Spark 1.5.2, DataFrame broadcast join, OOM

2016-02-23 Thread Yong Zhang
Hi, I am testing the Spark 1.5.2 using a 4 nodes cluster, with 64G memory each, and one is master and 3 are workers. I am using Standalone mode. Here is my spark-env.sh for settings: export SPARK_LOCAL_DIRS=/data1/spark/local,/data2/spark/local,/data3/spark/localexport

<    1   2