sql(ddl)
setConf(spark.sql.hive.convertMetastoreParquet, true)
}
You'll also need to run this to populate the statistics:
ANALYZE TABLE tableName COMPUTE STATISTICS noscan;
On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, currently there's cost-based
, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Yes, looks like it can only be controlled by the
parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
to me.
How am I suppose to know the exact bytes of a table? Let me specify the
join algorithm is preferred I
, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's
in the following patch.
https://github.com/apache/spark/commit
I cannot find it in the documentation. And I have a dozen dimension tables
to (left) join...
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
yuzhih...@gmail.com wrote:
Have you looked at SPARK-1800 ?
e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
Cheers
On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I cannot find it in the documentation. And I have a dozen dimension
tables
is in hadoop 2.6.0
Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ?
On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like it's a HDFS issue, pretty new.
https://issues.apache.org/jira/browse/HDFS-6999
Jianshi
On Thu, Sep 25
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
--
Jianshi
. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$
anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179)
Can you reveal what HbaseRDDBatch.scala does ?
Cheers
On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
One of my big spark program always get stuck at 99% where a few tasks
never
in the dark: have you checked region server (logs) to see if
region server had trouble keeping up ?
Cheers
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to HBase rowkey and columns and insert them to
HBase (in batch).
BTW, I found
regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge
finding that key
and replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to HBase rowkey and columns and insert them to
HBase (in batch).
BTW, I found batched Put actually faster than generating HFiles
Looks like it's a HDFS issue, pretty new.
https://issues.apache.org/jira/browse/HDFS-6999
Jianshi
On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
See my previous reply to Debasish, all region servers are idle. I don't
think it's caused by hotspotting
Thanks Tobias,
I also found this: https://issues.apache.org/jira/browse/SPARK-3299
Looks like it's been working on.
Jianshi
On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Err
Err... there's no such feature?
Jianshi
On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com
I se. Thanks Prashant!
Jianshi
On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com
wrote:
Hey,
You can use spark-shell -i sparkrc, to do this.
Prashant Sharma
On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
To make my shell experience
To make my shell experience merrier, I need to import several packages, and
define implicit sparkContext and sqlContext.
Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when
it's started?
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
-on-large-RDDs-tp2533p2537.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
in my
select clause.
I made the duplication on purpose for my code to parse correctly. I think
we should allow users to specify duplicated columns as return value.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
shuffle file numbers, but the
concurrent opened file number is the same as basic hash-based shuffle.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Thursday, July 31, 2014 10:34 AM
*To:* user@spark.apache.org
*Cc:* xia...@sjtu.edu.cn
*Subject:* Re
-n to set open files limit
(and other limits also)
And I set -n to 10240.
I see spark.shuffle.consolidateFiles helps by reusing open files.
(so I don't know to what extend does it help)
Hope it helps.
Larry
On 7/30/14, 4:01 PM, Jianshi Huang wrote:
I'm using Spark 1.0.1 on Yarn
?
This would be helpful. I personally like Yarn-Client mode as all the
running status can be checked directly from the console.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I can successfully run my code in local mode using spark-submit (--master
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
Any hints what is the problem? Is it a closure serialization problem
$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
or PutSortReducer)
But in Spark, it seems I have to do the sorting and partition myself, right?
Can anyone show me how to do it properly? Is there a better way to ingest
data fast to HBase from Spark?
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
was like this
b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*)
Thanks Regards,
Meethu M
On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang
jianshi.hu...@gmail.com wrote:
It would be convenient if Spark's textFile, parquetFile, etc. can
support path with wildcard
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem? In
hadoop fs -ls, the path
(hdfs://domain/user/jianshuang/data/parquet/table/month=2014*)
will get list all the files.
I'll test it again.
Jianshi
On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang jianshi.hu
of their name?
On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem?
In hadoop fs -ls, the path (hdfs://domain/user/
jianshuang/data/parquet/table/month=2014*) will get list all the files
-time-2.3.jar
kryo-2.21.jar
libthrift.jar
quasiquotes_2.10-2.0.0-M8.jar
scala-async_2.10-0.9.1.jar
scala-library-2.10.4.jar
scala-reflect-2.10.4.jar
Anyone has hint what went wrong? Really confused.
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
at same address: akka.tcp://
sparkwor...@lvshdc5dn0321.lvs.paypal.com:41987
Is that a bug?
Jianshi
On Tue, Jun 17, 2014 at 5:41 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I've stuck using either yarn-client or standalone-client mode. Either will
stuck when I submit jobs, the last
of it? If the latter, could you try running it from within the
cluster and see if it works? (Does your rtgraph.jar exist on the machine
from which you run spark-submit?)
2014-06-17 2:41 GMT-07:00 Jianshi Huang jianshi.hu...@gmail.com:
Hi,
I've stuck using either yarn-client or standalone-client mode
It would be convenient if Spark's textFile, parquetFile, etc. can support
path with wildcard, such as:
hdfs://domain/user/jianshuang/data/parquet/table/month=2014*
Or is there already a way to do it now?
Jianshi
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
I googled it and looks like I need to exclude some JARs. Anyone has done
that? Your help is really appreciated.
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
...@sigmoidanalytics.com
wrote:
Hi
Check in your driver programs Environment, (eg:
http://192.168.1.39:4040/environment/). If you don't see this
commons-codec-1.7.jar jar then that's the issue.
Thanks
Best Regards
On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote
101 - 135 of 135 matches
Mail list logo