/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
I googled it and looks like I need to exclude some JARs. Anyone has done
that? Your help is really appreciated.
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
...@sigmoidanalytics.com
wrote:
Hi
Check in your driver programs Environment, (eg:
http://192.168.1.39:4040/environment/). If you don't see this
commons-codec-1.7.jar jar then that's the issue.
Thanks
Best Regards
On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote
-time-2.3.jar
kryo-2.21.jar
libthrift.jar
quasiquotes_2.10-2.0.0-M8.jar
scala-async_2.10-0.9.1.jar
scala-library-2.10.4.jar
scala-reflect-2.10.4.jar
Anyone has hint what went wrong? Really confused.
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
at same address: akka.tcp://
sparkwor...@lvshdc5dn0321.lvs.paypal.com:41987
Is that a bug?
Jianshi
On Tue, Jun 17, 2014 at 5:41 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I've stuck using either yarn-client or standalone-client mode. Either will
stuck when I submit jobs, the last
of it? If the latter, could you try running it from within the
cluster and see if it works? (Does your rtgraph.jar exist on the machine
from which you run spark-submit?)
2014-06-17 2:41 GMT-07:00 Jianshi Huang jianshi.hu...@gmail.com:
Hi,
I've stuck using either yarn-client or standalone-client mode
It would be convenient if Spark's textFile, parquetFile, etc. can support
path with wildcard, such as:
hdfs://domain/user/jianshuang/data/parquet/table/month=2014*
Or is there already a way to do it now?
Jianshi
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
was like this
b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*)
Thanks Regards,
Meethu M
On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang
jianshi.hu...@gmail.com wrote:
It would be convenient if Spark's textFile, parquetFile, etc. can
support path with wildcard
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem? In
hadoop fs -ls, the path
(hdfs://domain/user/jianshuang/data/parquet/table/month=2014*)
will get list all the files.
I'll test it again.
Jianshi
On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang jianshi.hu
of their name?
On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem?
In hadoop fs -ls, the path (hdfs://domain/user/
jianshuang/data/parquet/table/month=2014*) will get list all the files
or PutSortReducer)
But in Spark, it seems I have to do the sorting and partition myself, right?
Can anyone show me how to do it properly? Is there a better way to ingest
data fast to HBase from Spark?
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I can successfully run my code in local mode using spark-submit (--master
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
Any hints what is the problem? Is it a closure serialization problem
?
This would be helpful. I personally like Yarn-Client mode as all the
running status can be checked directly from the console.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
-n to set open files limit
(and other limits also)
And I set -n to 10240.
I see spark.shuffle.consolidateFiles helps by reusing open files.
(so I don't know to what extend does it help)
Hope it helps.
Larry
On 7/30/14, 4:01 PM, Jianshi Huang wrote:
I'm using Spark 1.0.1 on Yarn
shuffle file numbers, but the
concurrent opened file number is the same as basic hash-based shuffle.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Thursday, July 31, 2014 10:34 AM
*To:* user@spark.apache.org
*Cc:* xia...@sjtu.edu.cn
*Subject:* Re
in my
select clause.
I made the duplication on purpose for my code to parse correctly. I think
we should allow users to specify duplicated columns as return value.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
-on-large-RDDs-tp2533p2537.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
To make my shell experience merrier, I need to import several packages, and
define implicit sparkContext and sqlContext.
Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when
it's started?
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
I se. Thanks Prashant!
Jianshi
On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com
wrote:
Hey,
You can use spark-shell -i sparkrc, to do this.
Prashant Sharma
On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
To make my shell experience
Err... there's no such feature?
Jianshi
On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com
Thanks Tobias,
I also found this: https://issues.apache.org/jira/browse/SPARK-3299
Looks like it's been working on.
Jianshi
On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Err
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
--
Jianshi
. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$
anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179)
Can you reveal what HbaseRDDBatch.scala does ?
Cheers
On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
One of my big spark program always get stuck at 99% where a few tasks
never
in the dark: have you checked region server (logs) to see if
region server had trouble keeping up ?
Cheers
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to HBase rowkey and columns and insert them to
HBase (in batch).
BTW, I found
regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge
finding that key
and replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to HBase rowkey and columns and insert them to
HBase (in batch).
BTW, I found batched Put actually faster than generating HFiles
Looks like it's a HDFS issue, pretty new.
https://issues.apache.org/jira/browse/HDFS-6999
Jianshi
On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
See my previous reply to Debasish, all region servers are idle. I don't
think it's caused by hotspotting
is in hadoop 2.6.0
Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ?
On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like it's a HDFS issue, pretty new.
https://issues.apache.org/jira/browse/HDFS-6999
Jianshi
On Thu, Sep 25
I cannot find it in the documentation. And I have a dozen dimension tables
to (left) join...
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
yuzhih...@gmail.com wrote:
Have you looked at SPARK-1800 ?
e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
Cheers
On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I cannot find it in the documentation. And I have a dozen dimension
tables
, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Yes, looks like it can only be controlled by the
parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
to me.
How am I suppose to know the exact bytes of a table? Let me specify the
join algorithm is preferred I
, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's
in the following patch.
https://github.com/apache/spark/commit
sql(ddl)
setConf(spark.sql.hive.convertMetastoreParquet, true)
}
You'll also need to run this to populate the statistics:
ANALYZE TABLE tableName COMPUTE STATISTICS noscan;
On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, currently there's cost-based
dim tables (using
HiveContext) and then map it to my class object. It failed a couple of
times and now I cached the intermediate table and currently it seems
working fine... no idea why until I found SPARK-3106
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
Hmm... it failed again, just lasted a little bit longer.
Jianshi
On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
https://issues.apache.org/jira/browse/SPARK-3106
I'm having the saming errors described in SPARK-3106 (no other types of
errors confirmed), running
Turned out it was caused by this issue:
https://issues.apache.org/jira/browse/SPARK-3923
Set spark.akka.heartbeat.interval to 100 solved it.
Jianshi
On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hmm... it failed again, just lasted a little bit longer.
Jianshi
On Tue, Oct 14, 2014 at 4:36 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Turned out it was caused by this issue:
https://issues.apache.org/jira/browse/SPARK-3923
Set spark.akka.heartbeat.interval to 100 solved it.
Jianshi
On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu
program from time to time.
Is there a mechanism that Spark stream can load and plugin code in runtime
without restarting?
Any solutions or suggestions?
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s
per topic).
Which configuration do you recommend?
- 1 Spark app consuming all Kafka topics
- 10 separate Spark app each consuming one topic
Assuming they have the same resource pool.
Cheers,
--
Jianshi Huang
LinkedIn
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
)
}
}
}
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
at
it.
For your case, I think TD’s comment are quite meaningful, it’s not trivial
to do so, often requires a job to scan all the records, it’s also not the
design purpose of Spark Streaming, I guess it’s hard to achieve what you
want.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu
to arrange data, but
you cannot avoid scanning the whole data. Basically we need to avoid
fetching large amount of data back to driver.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Monday, October 27, 2014 2:39 PM
*To:* Shao, Saisai
*Cc:* user
PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
You're absolutely right, it's not 'scalable' as I'm using collect().
However, it's important to have the RDDs ordered by the timestamp of the
time window (groupBy puts data to corresponding timewindow).
It's fairly easy to do in Pig, but somehow
nested RDD in closure.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Monday, October 27, 2014 3:30 PM
*To:* Shao, Saisai
*Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com)
*Subject:* Re: RDD to DStream
Ok, back to Scala code, I'm wondering
HiveContext is a subclass, we should make the same
semantics as default. Make sense?
Spark is very much functional and shared nothing, these are wonderful
features. Let's not have something global as a dependency.
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
On Mon, Oct 27, 2014 at 4:44 PM, Shao, Saisai saisai.s...@intel.com wrote:
Yes, I understand what you want, but maybe hard to achieve without
collecting back to driver node.
Besides, can we just think of another way to do it.
Thanks
Jerry
*From:* Jianshi Huang
Any suggestion? :)
Jianshi
On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s
per topic).
Which configuration do you recommend?
- 1 Spark app consuming all Kafka topics
- 10 separate Spark
can use an in-memory Derby
database as metastore
https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html
I'll investigate this when free, guess we can use this for Spark SQL Hive
support testing.
On 10/27/14 4:38 PM, Jianshi Huang wrote:
There's an annoying small usability
has idea what went wrong? Need help!
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
akka.version2.3.4-spark/akka.version
it should solve problem. Makes sense? I'll give it a shot when I have time,
now probably I'll just not using Spray client...
Cheers,
Jianshi
On Tue, Oct 28, 2014 at 6:02 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got the following
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right?
Jianshi
On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller moham...@glassbeam.com
wrote:
Try a version built with Akka 2.2.x
Mohammed
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Tuesday
that.
Can you try a Spray version built with 2.2.x along with Spark 1.1 and
include the Akka dependencies in your project’s sbt file?
Mohammed
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Tuesday, October 28, 2014 8:58 PM
*To:* Mohammed Guller
*Cc:* user
*Subject:* Re
-version suffixes in:
org.scalamacros:quasiq
uotes
On Thu, Oct 30, 2014 at 9:50 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Preshant, Chester, Mohammed,
I switched to Spark's Akka and now it works well. Thanks for the help!
(Need to exclude Akka from Spray dependencies, or specify
needs to be
collect to driver, is there a way to avoid doing this?
Thanks
Jianshi
On Mon, Oct 27, 2014 at 4:57 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Sure, let's still focus on the streaming simulation use case. It's a very
useful problem to solve.
If we're going to use the same
-mapreduce-to-apache-spark/
On 11/14/14 10:44 AM, Dai, Kevin wrote:
HI, all
Is there setup and cleanup function as in hadoop mapreduce in spark which
does some initialization and cleanup work?
Best Regards,
Kevin.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github
@gmail.com wrote:
If you’re just relying on the side effect of setup() and cleanup() then
I think this trick is OK and pretty cleaner.
But if setup() returns, say, a DB connection, then the map(...) part and
cleanup() can’t get the connection object.
On 11/14/14 1:20 PM, Jianshi Huang wrote:
So
: scala.r
eflect.internal.MissingRequirementError: object scala.runtime in compiler
mirror not found. - [Help 1]
Anyone knows what's the problem?
I'm building it on OSX. I didn't had this problem one month ago.
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, then we need another trick.
let's have an *implicit lazy var connection/context* around our code.
And setup() will trigger the eval and initialization.
Due to lazy evaluation, I think having setup/teardown is a bit
Any notable issues for using Scala 2.11? Is it stable now?
Or can I use Scala 2.11 in my spark application and use Spark dist build
with 2.10 ?
I'm looking forward to migrate to 2.11 for some quasiquote features.
Couldn't make it run in 2.10...
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
the build instructions here :
https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md
Prashant Sharma
On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Any notable issues for using Scala 2.11? Is it stable now?
Or can I use Scala 2.11 in my
Hi,
I got an error during rdd.registerTempTable(...) saying scala.MatchError:
scala.BigInt
Looks like BigInt cannot be used in SchemaRDD, is that correct?
So what would you recommend to deal with it?
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http
@gmail.com wrote:
Hello Jianshi,
The reason of that error is that we do not have a Spark SQL data type for
Scala BigInt. You can use Decimal for your case.
Thanks,
Yin
On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got an error during
)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
Using the same DDL and Analyze script above.
Jianshi
On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
It works fine, thanks for the help Michael.
Liancheng
/usr/lib/hive/lib doesn’t show any of the parquet
jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as
parquet-hive-1.0.jar
Is it removed from latest Spark?
Jianshi
On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
Looks like the latest SparkSQL
similar situation?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
/3270 should be
another optimization for this.
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Wednesday, November 26, 2014 4:36 PM
*To:* user
*Subject:* Auto BroadcastJoin optimization failed in latest Spark
Hi,
I've confirmed that the latest Spark with either Hive
using latest Spark built from master HEAD yesterday. Is this a bug?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
?
Jianshi
On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I got the following error during Spark startup (Yarn-client mode):
14/12/04 19:33:58 INFO Client: Uploading resource
file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar
-
hdfs
Actually my HADOOP_CLASSPATH has already been set to include
/etc/hadoop/conf/*
export
HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase
classpath)
Jianshi
On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in
Yarn-client mode.
Maybe this patch broke yarn-client.
https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53
Jianshi
On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote
Correction:
According to Liancheng, this hotfix might be the root cause:
https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce
Jianshi
On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like the datanucleus*.jar shouldn't appear
-most
among the inner joins;
DESC EXTENDED tablename; -- this will print the detail information for the
statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.
Let me know if you still have problem.
Hao
*From:* Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Sorry for the late of follow-up.
I used Hao's DESC EXTENDED command and found some clue:
new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0
With Liancheng's suggestion, I've tried setting
spark.sql.hive.convertMetastoreParquet false
but still analyze noscan return -1 in rawDataSize
Jianshi
On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
If I run ANALYZE without NOSCAN, then Hive can successfully
fine for me on master. Note that Hive does print an
exception in the logs, but that exception does not propogate to user code.
On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got exception saying Hive: NoSuchObjectException(message:table table
table pmt (
sorted::id bigint
)
stored as parquet
location '...'
Obviously it didn't work, I also tried removing the identifier sorted::,
but the resulting rows contain only nulls.
Any idea how to create a table in HiveContext from these Parquet files?
Thanks,
Jianshi
--
Jianshi Huang
(t.schema.fields.map(s = s.copy(name =
s.name.replaceAll(.*?::,
sql(sdrop table $name)
applySchema(t, newSchema).registerTempTable(name)
I'm testing it for now.
Thanks for the help!
Jianshi
On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I had
Very interesting, the line doing drop table will throws an exception. After
removing it all works.
Jianshi
On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Here's the solution I got after talking with Liancheng:
1) using backquote `..` to wrap up all illegal
Hmm... another issue I found doing this approach is that ANALYZE TABLE ...
COMPUTE STATISTICS will fail to attach the metadata to the table, and later
broadcast join and such will fail...
Any idea how to fix this issue?
Jianshi
On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu
sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])
I created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-4781
Jianshi
On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hmm... another issue I found
Hmm..
I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782
Jianshi
On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD?
I'm currently converting each Map to a JSON String
, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, found another possible bug in Hive.
My current solution is to use ALTER TABLE CHANGE to rename the column
names.
The problem is after renaming the column names, the value of the columns
became all NULL.
Before renaming
FYI,
Latest hive 0.14/parquet will have column renaming support.
Jianshi
On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com
wrote:
You might also try out the recently added support for views.
On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote
Hi,
Anyone has implemented the default Pig Loader in Spark? (loading delimited
text files with .pig_schema)
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
:
I think we made the binary protocol compatible across all versions, so you
should be fine with using any one of them. 1.2.1 is probably the best since
it is the most recent stable release.
On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I need to use
, 1.3.0)
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
: https://issues.apache.org/jira/browse/SPARK-5828
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353
On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
We're facing No space left on device errors lately from time to time.
The job will fail after retries. Obvious in such case, retry won't
spark.scheduler.executorTaskBlacklistTime to 3 to solve such No
space left on device errors. So if a task runs unsuccessfully in some
executor, it won't be scheduled to the same executor in 30 seconds.
Best Regards,
Shixiong Zhu
2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com:
I created a JIRA
Oh, by default it's set to 0L.
I'll try setting it to 3 immediately. Thanks for the help!
Jianshi
On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Thanks Shixiong!
Very strange that our tasks were retried on the same executor again and
again. I'll check
:
We don't support expressions or wildcards in that configuration. For
each application, the local directories need to be constant. If you
have users submitting different Spark applications, those can each set
spark.local.dirs.
- Patrick
On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang
directories either. Typically, like in YARN, you would a number of
directories (on different disks) mounted and configured for local
storage for jobs.
On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Unfortunately /tmp mount is really small in our environment. I
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4
started by SBT console command also failed with this error. However running
from a standard spark shell works.
Jianshi
On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hmm... look like
Hmm... look like the console command still starts a Spark 1.3.0 with Scala
2.11.6 even I changed them in build.sbt.
So the test with 1.2.1 is not valid.
Jianshi
On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I've confirmed it only failed in console started
[info] try { f } finally { println(Elapsed: + (now - start)/1000.0 +
s) }
[info] }
[info]
[info] @transient val sqlc = new org.apache.spark.sql.SQLContext(sc)
[info] implicit def sqlContext = sqlc
[info] import sqlc._
Jianshi
On Fri, Mar 13, 2015 at 3:10 AM, Jianshi Huang jianshi.hu...@gmail.com
Liancheng also found out that the Spark jars are not included in the
classpath of URLClassLoader.
Hmm... we're very close to the truth now.
Jianshi
On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I'm almost certain the problem is the ClassLoader.
So adding
1 - 100 of 135 matches
Mail list logo