Re: How to do broadcast join in SparkSQL

2014-10-11 Thread Jianshi Huang
sql(ddl) setConf(spark.sql.hive.convertMetastoreParquet, true) } You'll also need to run this to populate the statistics: ANALYZE TABLE tableName COMPUTE STATISTICS noscan; On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, currently there's cost-based

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Yes, looks like it can only be controlled by the parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird to me. How am I suppose to know the exact bytes of a table? Let me specify the join algorithm is preferred I

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged into master? I cannot find spark.sql.hints.broadcastTables in latest master, but it's in the following patch. https://github.com/apache/spark/commit

How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
yuzhih...@gmail.com wrote: Have you looked at SPARK-1800 ? e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Cheers On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I cannot find it in the documentation. And I have a dozen dimension tables

Re:

2014-09-25 Thread Jianshi Huang
is in hadoop 2.6.0 Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ? On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25

[no subject]

2014-09-24 Thread Jianshi Huang
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Jianshi

Re:

2014-09-24 Thread Jianshi Huang
. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$ anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) Can you reveal what HbaseRDDBatch.scala does ? Cheers On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: One of my big spark program always get stuck at 99% where a few tasks never

Re:

2014-09-24 Thread Jianshi Huang
in the dark: have you checked region server (logs) to see if region server had trouble keeping up ? Cheers On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found

Re:

2014-09-24 Thread Jianshi Huang
regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge

Re:

2014-09-24 Thread Jianshi Huang
finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found batched Put actually faster than generating HFiles

Re:

2014-09-24 Thread Jianshi Huang
Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, See my previous reply to Debasish, all region servers are idle. I don't think it's caused by hotspotting

Re: How to list all registered tables in a sql context?

2014-09-08 Thread Jianshi Huang
Thanks Tobias, I also found this: https://issues.apache.org/jira/browse/SPARK-3299 Looks like it's been working on. Jianshi On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Err

Re: How to list all registered tables in a sql context?

2014-09-05 Thread Jianshi Huang
Err... there's no such feature? Jianshi On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com

Re: .sparkrc for Spark shell?

2014-09-04 Thread Jianshi Huang
I se. Thanks Prashant! Jianshi On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com wrote: Hey, You can use spark-shell -i sparkrc, to do this. Prashant Sharma On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: To make my shell experience

.sparkrc for Spark shell?

2014-09-03 Thread Jianshi Huang
To make my shell experience merrier, I need to import several packages, and define implicit sparkContext and sqlContext. Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when it's started? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

How to list all registered tables in a sql context?

2014-09-03 Thread Jianshi Huang
Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
-on-large-RDDs-tp2533p2537.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Spark SQL (version 1.1.0-SNAPSHOT) should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang
in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang
shuffle file numbers, but the concurrent opened file number is the same as basic hash-based shuffle. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Thursday, July 31, 2014 10:34 AM *To:* user@spark.apache.org *Cc:* xia...@sjtu.edu.cn *Subject:* Re

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
-n to set open files limit (and other limits also) And I set -n to 10240. I see spark.shuffle.consolidateFiles helps by reusing open files. (so I don't know to what extend does it help) Hope it helps. Larry On 7/30/14, 4:01 PM, Jianshi Huang wrote: I'm using Spark 1.0.1 on Yarn

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Jianshi Huang
? This would be helpful. I personally like Yarn-Client mode as all the running status can be checked directly from the console. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-27 Thread Jianshi Huang
, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode. Any hints what is the problem? Is it a closure serialization problem

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-24 Thread Jianshi Huang
$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Use Spark with HBase' HFileOutputFormat

2014-07-16 Thread Jianshi Huang
or PutSortReducer) But in Spark, it seems I have to do the sorting and partition myself, right? Can anyone show me how to do it properly? Is there a better way to ingest data fast to HBase from Spark? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
was like this b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*) Thanks Regards, Meethu M On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/jianshuang/data/parquet/table/month=2014*) will get list all the files. I'll test it again. Jianshi On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang jianshi.hu

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/ jianshuang/data/parquet/table/month=2014*) will get list all the files

Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
-time-2.3.jar kryo-2.21.jar libthrift.jar quasiquotes_2.10-2.0.0-M8.jar scala-async_2.10-0.9.1.jar scala-library-2.10.4.jar scala-reflect-2.10.4.jar Anyone has hint what went wrong? Really confused. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
at same address: akka.tcp:// sparkwor...@lvshdc5dn0321.lvs.paypal.com:41987 Is that a bug? Jianshi On Tue, Jun 17, 2014 at 5:41 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I've stuck using either yarn-client or standalone-client mode. Either will stuck when I submit jobs, the last

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
of it? If the latter, could you try running it from within the cluster and see if it works? (Does your rtgraph.jar exist on the machine from which you run spark-submit?) 2014-06-17 2:41 GMT-07:00 Jianshi Huang jianshi.hu...@gmail.com: Hi, I've stuck using either yarn-client or standalone-client mode

Wildcard support in input path

2014-06-17 Thread Jianshi Huang
It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
...@sigmoidanalytics.com wrote: Hi Check in your driver programs Environment, (eg: http://192.168.1.39:4040/environment/). If you don't see this commons-codec-1.7.jar jar then that's the issue. Thanks Best Regards On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

<    1   2