Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
t; But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/c

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
does not get > expanded by the shell). > > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.com/apache/spark

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
tting SPARK_HOME in the environment of >> your node managers. YARN shouldn't need to know about that. >> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang >> wrote: >> > >> > >> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
m your gateway machine to YARN by > default. > > You probably have some configuration (in spark-defaults.conf) that > tells YARN to use a cached copy. Get rid of that configuration, and > you can use whatever version you like. > On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang > wrote:

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
er-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip'] > sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", > conf=sparkConf, pyFiles=py_files) > > Thanks, -- Jianshi Huang

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
directory. On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm using Spark 1.4.0-rc1

View all user's application logs in history server

2015-05-20 Thread Jianshi Huang
Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it possible to view all user's logs? The permission is fine for the user group. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Why so slow

2015-05-12 Thread Jianshi Huang
)) PhysicalRDD [meta#143,nvar#145,date#147], MapPartitionsRDD[6] at explain at console:32 Jianshi On Tue, May 12, 2015 at 10:34 PM, Olivier Girardot ssab...@gmail.com wrote: can you post the explain too ? Le mar. 12 mai 2015 à 12:11, Jianshi Huang jianshi.hu...@gmail.com a écrit : Hi, I

Why so slow

2015-05-12 Thread Jianshi Huang
is still open, when can we have it fixed? :) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Can you please share your compression etc settings, which you are using. Thanks, Twinkle On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote: The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Jianshi Huang
Huai yh...@databricks.com wrote: The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks like

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
(MessageColumnIO.java:96) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
, friction can be a huge factor in the equations in some other it is just part of the landscape *From:* Gerard Maas [mailto:gerard.m...@gmail.com] *Sent:* Friday, April 17, 2015 10:12 AM *To:* Evo Eftimov *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie *Subject:* Re: How

How to do dispatching in Streaming?

2015-04-12 Thread Jianshi Huang
- multiple DStreams) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, We're facing No space left on device errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
spark.scheduler.executorTaskBlacklistTime to 3 to solve such No space left on device errors. So if a task runs unsuccessfully in some executor, it won't be scheduled to the same executor in 30 seconds. Best Regards, Shixiong Zhu 2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com: I created a JIRA

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... look like

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I've confirmed it only failed in console started

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
[info] try { f } finally { println(Elapsed: + (now - start)/1000.0 + s) } [info] } [info] [info] @transient val sqlc = new org.apache.spark.sql.SQLContext(sc) [info] implicit def sqlContext = sqlc [info] import sqlc._ Jianshi On Fri, Mar 13, 2015 at 3:10 AM, Jianshi Huang jianshi.hu...@gmail.com

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I'm almost certain the problem is the ClassLoader. So adding

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
. Thanks Ashish -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang jianshi.hu...@gmail.com

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
:23 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Same issue here. But the classloader in my exception is somehow different. scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with java.net.URLClassLoader@53298398 of type class

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
spark.version1.2.1/spark.version scala.version2.11.5/scala.version Please let me know how can I resolve this problem. Thanks Ashish -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
: We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
directories either. Typically, like in YARN, you would a number of directories (on different disks) mounted and configured for local storage for jobs. On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Unfortunately /tmp mount is really small in our environment. I

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Jianshi Huang
:01 PM, Shao, Saisai saisai.s...@intel.com wrote: I think there’s a lot of JIRA trying to solve this problem ( https://issues.apache.org/jira/browse/SPARK-5763). Basically sort merge join is a good choice. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jianshi Huang jianshi.hu

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: I see. I'm using core's join. The data might have some skewness (checking). I understand shuffle can spill data to disk but when consuming it, say in cogroup or groupByKey, it still needs to read the whole group elements, right? I guess OOM happened

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
is skewed or key number is smaller, so you will meet OOM. Maybe you could monitor each stage or task’s shuffle and GC status also system status to identify the problem. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Thursday, March 5, 2015 2:32 PM *To:* Aaron

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
the shuffle related operations can spill the data into disk and no need to read the whole partition into memory. But if you uses SparkSQL, it depends on how SparkSQL uses this operators. CC @hao if he has some thoughts on it. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... ok, previous errors are still block fetch errors. 15/03/03 10:22:40 ERROR RetryingBlockFetcher: Exception while beginning fetch of 11 outstanding blocks java.io.IOException: Failed to connect to host-/:55597

[no subject]

2015-03-03 Thread Jianshi Huang
-SNAPSHOT I built around Dec. 20. Is there any bug fixes related to shuffle block fetching or index files after that? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) Jianshi On Wed, Mar 4, 2015 at 2:55 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got this error

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Davidson ilike...@gmail.com wrote: Drat! That doesn't help. Could you scan from the top to see if there were any fatal errors preceding these? Sometimes a OOM will cause this type of issue further down. On Tue, Mar 3, 2015 at 8:16 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: The failed

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
check its logs as well. On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Sorry that I forgot the subject. And in the driver, I got many FetchFailedException. The error messages are 15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in stage 2.2 (TID 7943

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
: https://issues.apache.org/jira/browse/SPARK-5828 Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
: I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to use

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
, 1.3.0) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Jianshi Huang
) at org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143) I'm using latest branch-1.2 I found in PR that percentile and percentile_approx are supported. A bug? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Jianshi Huang
SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false) resolver.getEvaluator(parameterInfo) FYI On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, The following SQL query select percentile_approx(variables.var1, 0.95) p95 from model will throw ERROR

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com wrote: You might also try out the recently added support for views. On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... another issue I found

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table table

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks, Jianshi -- Jianshi Huang

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
(t.schema.fields.map(s = s.copy(name = s.name.replaceAll(.*?::, sql(sdrop table $name) applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Thanks for the help! Jianshi On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I had

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
using latest Spark built from master HEAD yesterday. Is this a bug? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
? Jianshi On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - hdfs

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like the datanucleus*.jar shouldn't appear

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
-most among the inner joins; DESC EXTENDED tablename; -- this will print the detail information for the statistic table size (the field “totalSize”) EXPLAIN EXTENDED query; -- this will print the detail physical plan. Let me know if you still have problem. Hao *From:* Jianshi Huang

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue: new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: If I run ANALYZE without NOSCAN, then Hive can successfully

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
/3270 should be another optimization for this. *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Wednesday, November 26, 2014 4:36 PM *To:* user *Subject:* Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive

Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Jianshi Huang
similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) Using the same DDL and Analyze script above. Jianshi On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: It works fine, thanks for the help Michael. Liancheng

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
/usr/lib/hive/lib doesn’t show any of the parquet jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as parquet-hive-1.0.jar Is it removed from latest Spark? Jianshi On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, Looks like the latest SparkSQL

How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Hi, I got an error during rdd.registerTempTable(...) saying scala.MatchError: scala.BigInt Looks like BigInt cannot be used in SchemaRDD, is that correct? So what would you recommend to deal with it? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
@gmail.com wrote: Hello Jianshi, The reason of that error is that we do not have a Spark SQL data type for Scala BigInt. You can use Decimal for your case. Thanks, Yin On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got an error during

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-18 Thread Jianshi Huang
the build instructions here : https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md Prashant Sharma On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having setup/teardown is a bit

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
: scala.r eflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. - [Help 1] Anyone knows what's the problem? I'm building it on OSX. I didn't had this problem one month ago. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
-mapreduce-to-apache-spark/ On 11/14/14 10:44 AM, Dai, Kevin wrote: HI, all Is there setup and cleanup function as in hadoop mapreduce in spark which does some initialization and cleanup work? Best Regards, Kevin. ​ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
@gmail.com wrote: If you’re just relying on the side effect of setup() and cleanup() then I think this trick is OK and pretty cleaner. But if setup() returns, say, a DB connection, then the map(...) part and cleanup() can’t get the connection object. On 11/14/14 1:20 PM, Jianshi Huang wrote: So

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
needs to be collect to driver, is there a way to avoid doing this? Thanks Jianshi On Mon, Oct 27, 2014 at 4:57 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Sure, let's still focus on the streaming simulation use case. It's a very useful problem to solve. If we're going to use the same

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
-version suffixes in: org.scalamacros:quasiq uotes On Thu, Oct 30, 2014 at 9:50 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Preshant, Chester, Mohammed, I switched to Spark's Akka and now it works well. Thanks for the help! (Need to exclude Akka from Spray dependencies, or specify

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
that. Can you try a Spray version built with 2.2.x along with Spark 1.1 and include the Akka dependencies in your project’s sbt file? Mohammed *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Tuesday, October 28, 2014 8:58 PM *To:* Mohammed Guller *Cc:* user *Subject:* Re

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
has idea what went wrong? Need help! -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
akka.version2.3.4-spark/akka.version it should solve problem. Makes sense? I'll give it a shot when I have time, now probably I'll just not using Spray client... Cheers, Jianshi On Tue, Oct 28, 2014 at 6:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got the following

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right? Jianshi On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller moham...@glassbeam.com wrote: Try a version built with Akka 2.2.x Mohammed *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Tuesday

Spark streaming update/restart gracefully

2014-10-27 Thread Jianshi Huang
? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
at it. For your case, I think TD’s comment are quite meaningful, it’s not trivial to do so, often requires a job to scan all the records, it’s also not the design purpose of Spark Streaming, I guess it’s hard to achieve what you want. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
to arrange data, but you cannot avoid scanning the whole data. Basically we need to avoid fetching large amount of data back to driver. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 2:39 PM *To:* Shao, Saisai *Cc:* user

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
PM, Jianshi Huang jianshi.hu...@gmail.com wrote: You're absolutely right, it's not 'scalable' as I'm using collect(). However, it's important to have the RDDs ordered by the timestamp of the time window (groupBy puts data to corresponding timewindow). It's fairly easy to do in Pig, but somehow

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
nested RDD in closure. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 3:30 PM *To:* Shao, Saisai *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Ok, back to Scala code, I'm wondering

Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
HiveContext is a subclass, we should make the same semantics as default. Make sense? Spark is very much functional and shared nothing, these are wonderful features. Let's not have something global as a dependency. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
On Mon, Oct 27, 2014 at 4:44 PM, Shao, Saisai saisai.s...@intel.com wrote: Yes, I understand what you want, but maybe hard to achieve without collecting back to driver node. Besides, can we just think of another way to do it. Thanks Jerry *From:* Jianshi Huang

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Jianshi Huang
Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
can use an in-memory Derby database as metastore https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html I'll investigate this when free, guess we can use this for Spark SQL Hive support testing. On 10/27/14 4:38 PM, Jianshi Huang wrote: There's an annoying small usability

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
) } } } -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Dynamically loaded Spark-stream consumer

2014-10-23 Thread Jianshi Huang
program from time to time. Is there a mechanism that Spark stream can load and plugin code in runtime without restarting? Any solutions or suggestions? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-23 Thread Jianshi Huang
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark app each consuming one topic Assuming they have the same resource pool. Cheers, -- Jianshi Huang LinkedIn

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
dim tables (using HiveContext) and then map it to my class object. It failed a couple of times and now I cached the intermediate table and currently it seems working fine... no idea why until I found SPARK-3106 Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... it failed again, just lasted a little bit longer. Jianshi

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
On Tue, Oct 14, 2014 at 4:36 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu

  1   2   >