Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
...@sigmoidanalytics.com wrote: Hi Check in your driver programs Environment, (eg: http://192.168.1.39:4040/environment/). If you don't see this commons-codec-1.7.jar jar then that's the issue. Thanks Best Regards On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
-time-2.3.jar kryo-2.21.jar libthrift.jar quasiquotes_2.10-2.0.0-M8.jar scala-async_2.10-0.9.1.jar scala-library-2.10.4.jar scala-reflect-2.10.4.jar Anyone has hint what went wrong? Really confused. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
at same address: akka.tcp:// sparkwor...@lvshdc5dn0321.lvs.paypal.com:41987 Is that a bug? Jianshi On Tue, Jun 17, 2014 at 5:41 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I've stuck using either yarn-client or standalone-client mode. Either will stuck when I submit jobs, the last

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
of it? If the latter, could you try running it from within the cluster and see if it works? (Does your rtgraph.jar exist on the machine from which you run spark-submit?) 2014-06-17 2:41 GMT-07:00 Jianshi Huang jianshi.hu...@gmail.com: Hi, I've stuck using either yarn-client or standalone-client mode

Wildcard support in input path

2014-06-17 Thread Jianshi Huang
It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
was like this b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*) Thanks Regards, Meethu M On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/jianshuang/data/parquet/table/month=2014*) will get list all the files. I'll test it again. Jianshi On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang jianshi.hu

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/ jianshuang/data/parquet/table/month=2014*) will get list all the files

Use Spark with HBase' HFileOutputFormat

2014-07-16 Thread Jianshi Huang
or PutSortReducer) But in Spark, it seems I have to do the sorting and partition myself, right? Can anyone show me how to do it properly? Is there a better way to ingest data fast to HBase from Spark? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-24 Thread Jianshi Huang
$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-27 Thread Jianshi Huang
, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode. Any hints what is the problem? Is it a closure serialization problem

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Jianshi Huang
? This would be helpful. I personally like Yarn-Client mode as all the running status can be checked directly from the console. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
-n to set open files limit (and other limits also) And I set -n to 10240. I see spark.shuffle.consolidateFiles helps by reusing open files. (so I don't know to what extend does it help) Hope it helps. Larry On 7/30/14, 4:01 PM, Jianshi Huang wrote: I'm using Spark 1.0.1 on Yarn

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang
shuffle file numbers, but the concurrent opened file number is the same as basic hash-based shuffle. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Thursday, July 31, 2014 10:34 AM *To:* user@spark.apache.org *Cc:* xia...@sjtu.edu.cn *Subject:* Re

Spark SQL (version 1.1.0-SNAPSHOT) should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang
in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
-on-large-RDDs-tp2533p2537.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

.sparkrc for Spark shell?

2014-09-03 Thread Jianshi Huang
To make my shell experience merrier, I need to import several packages, and define implicit sparkContext and sqlContext. Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when it's started? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

How to list all registered tables in a sql context?

2014-09-03 Thread Jianshi Huang
Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: .sparkrc for Spark shell?

2014-09-04 Thread Jianshi Huang
I se. Thanks Prashant! Jianshi On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com wrote: Hey, You can use spark-shell -i sparkrc, to do this. Prashant Sharma On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: To make my shell experience

Re: How to list all registered tables in a sql context?

2014-09-05 Thread Jianshi Huang
Err... there's no such feature? Jianshi On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com

Re: How to list all registered tables in a sql context?

2014-09-08 Thread Jianshi Huang
Thanks Tobias, I also found this: https://issues.apache.org/jira/browse/SPARK-3299 Looks like it's been working on. Jianshi On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Err

[no subject]

2014-09-24 Thread Jianshi Huang
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Jianshi

Re:

2014-09-24 Thread Jianshi Huang
. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$ anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) Can you reveal what HbaseRDDBatch.scala does ? Cheers On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: One of my big spark program always get stuck at 99% where a few tasks never

Re:

2014-09-24 Thread Jianshi Huang
in the dark: have you checked region server (logs) to see if region server had trouble keeping up ? Cheers On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found

Re:

2014-09-24 Thread Jianshi Huang
regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge

Re:

2014-09-24 Thread Jianshi Huang
finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found batched Put actually faster than generating HFiles

Re:

2014-09-24 Thread Jianshi Huang
Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, See my previous reply to Debasish, all region servers are idle. I don't think it's caused by hotspotting

Re:

2014-09-25 Thread Jianshi Huang
is in hadoop 2.6.0 Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ? On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25

How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
yuzhih...@gmail.com wrote: Have you looked at SPARK-1800 ? e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Cheers On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I cannot find it in the documentation. And I have a dozen dimension tables

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
, 2014 at 1:24 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Yes, looks like it can only be controlled by the parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird to me. How am I suppose to know the exact bytes of a table? Let me specify the join algorithm is preferred I

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged into master? I cannot find spark.sql.hints.broadcastTables in latest master, but it's in the following patch. https://github.com/apache/spark/commit

Re: How to do broadcast join in SparkSQL

2014-10-11 Thread Jianshi Huang
sql(ddl) setConf(spark.sql.hive.convertMetastoreParquet, true) } You'll also need to run this to populate the statistics: ANALYZE TABLE tableName COMPUTE STATISTICS noscan; On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, currently there's cost-based

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
dim tables (using HiveContext) and then map it to my class object. It failed a couple of times and now I cached the intermediate table and currently it seems working fine... no idea why until I found SPARK-3106 Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... it failed again, just lasted a little bit longer. Jianshi

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
On Tue, Oct 14, 2014 at 4:36 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu

Dynamically loaded Spark-stream consumer

2014-10-23 Thread Jianshi Huang
program from time to time. Is there a mechanism that Spark stream can load and plugin code in runtime without restarting? Any solutions or suggestions? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-23 Thread Jianshi Huang
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark app each consuming one topic Assuming they have the same resource pool. Cheers, -- Jianshi Huang LinkedIn

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
) } } } -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Spark streaming update/restart gracefully

2014-10-27 Thread Jianshi Huang
? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
at it. For your case, I think TD’s comment are quite meaningful, it’s not trivial to do so, often requires a job to scan all the records, it’s also not the design purpose of Spark Streaming, I guess it’s hard to achieve what you want. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
to arrange data, but you cannot avoid scanning the whole data. Basically we need to avoid fetching large amount of data back to driver. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 2:39 PM *To:* Shao, Saisai *Cc:* user

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
PM, Jianshi Huang jianshi.hu...@gmail.com wrote: You're absolutely right, it's not 'scalable' as I'm using collect(). However, it's important to have the RDDs ordered by the timestamp of the time window (groupBy puts data to corresponding timewindow). It's fairly easy to do in Pig, but somehow

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
nested RDD in closure. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 3:30 PM *To:* Shao, Saisai *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Ok, back to Scala code, I'm wondering

Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
HiveContext is a subclass, we should make the same semantics as default. Make sense? Spark is very much functional and shared nothing, these are wonderful features. Let's not have something global as a dependency. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
On Mon, Oct 27, 2014 at 4:44 PM, Shao, Saisai saisai.s...@intel.com wrote: Yes, I understand what you want, but maybe hard to achieve without collecting back to driver node. Besides, can we just think of another way to do it. Thanks Jerry *From:* Jianshi Huang

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Jianshi Huang
Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
can use an in-memory Derby database as metastore https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html I'll investigate this when free, guess we can use this for Spark SQL Hive support testing. On 10/27/14 4:38 PM, Jianshi Huang wrote: There's an annoying small usability

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
has idea what went wrong? Need help! -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
akka.version2.3.4-spark/akka.version it should solve problem. Makes sense? I'll give it a shot when I have time, now probably I'll just not using Spray client... Cheers, Jianshi On Tue, Oct 28, 2014 at 6:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got the following

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right? Jianshi On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller moham...@glassbeam.com wrote: Try a version built with Akka 2.2.x Mohammed *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Tuesday

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
that. Can you try a Spray version built with 2.2.x along with Spark 1.1 and include the Akka dependencies in your project’s sbt file? Mohammed *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Tuesday, October 28, 2014 8:58 PM *To:* Mohammed Guller *Cc:* user *Subject:* Re

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
-version suffixes in: org.scalamacros:quasiq uotes On Thu, Oct 30, 2014 at 9:50 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Preshant, Chester, Mohammed, I switched to Spark's Akka and now it works well. Thanks for the help! (Need to exclude Akka from Spray dependencies, or specify

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
needs to be collect to driver, is there a way to avoid doing this? Thanks Jianshi On Mon, Oct 27, 2014 at 4:57 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Sure, let's still focus on the streaming simulation use case. It's a very useful problem to solve. If we're going to use the same

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
-mapreduce-to-apache-spark/ On 11/14/14 10:44 AM, Dai, Kevin wrote: HI, all Is there setup and cleanup function as in hadoop mapreduce in spark which does some initialization and cleanup work? Best Regards, Kevin. ​ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
@gmail.com wrote: If you’re just relying on the side effect of setup() and cleanup() then I think this trick is OK and pretty cleaner. But if setup() returns, say, a DB connection, then the map(...) part and cleanup() can’t get the connection object. On 11/14/14 1:20 PM, Jianshi Huang wrote: So

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
: scala.r eflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. - [Help 1] Anyone knows what's the problem? I'm building it on OSX. I didn't had this problem one month ago. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having setup/teardown is a bit

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-18 Thread Jianshi Huang
the build instructions here : https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md Prashant Sharma On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my

How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Hi, I got an error during rdd.registerTempTable(...) saying scala.MatchError: scala.BigInt Looks like BigInt cannot be used in SchemaRDD, is that correct? So what would you recommend to deal with it? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
@gmail.com wrote: Hello Jianshi, The reason of that error is that we do not have a Spark SQL data type for Scala BigInt. You can use Decimal for your case. Thanks, Yin On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got an error during

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) Using the same DDL and Analyze script above. Jianshi On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: It works fine, thanks for the help Michael. Liancheng

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
/usr/lib/hive/lib doesn’t show any of the parquet jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as parquet-hive-1.0.jar Is it removed from latest Spark? Jianshi On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, Looks like the latest SparkSQL

Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Jianshi Huang
similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
/3270 should be another optimization for this. *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Wednesday, November 26, 2014 4:36 PM *To:* user *Subject:* Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
using latest Spark built from master HEAD yesterday. Is this a bug? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
? Jianshi On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - hdfs

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like the datanucleus*.jar shouldn't appear

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
-most among the inner joins; DESC EXTENDED tablename; -- this will print the detail information for the statistic table size (the field “totalSize”) EXPLAIN EXTENDED query; -- this will print the detail physical plan. Let me know if you still have problem. Hao *From:* Jianshi Huang

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
, Jianshi Huang jianshi.hu...@gmail.com wrote: Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue: new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: If I run ANALYZE without NOSCAN, then Hive can successfully

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table table

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks, Jianshi -- Jianshi Huang

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
(t.schema.fields.map(s = s.copy(name = s.name.replaceAll(.*?::, sql(sdrop table $name) applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Thanks for the help! Jianshi On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I had

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... another issue I found

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com wrote: You might also try out the recently added support for views. On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
: I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to use

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
, 1.3.0) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
: https://issues.apache.org/jira/browse/SPARK-5828 Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, We're facing No space left on device errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
spark.scheduler.executorTaskBlacklistTime to 3 to solve such No space left on device errors. So if a task runs unsuccessfully in some executor, it won't be scheduled to the same executor in 30 seconds. Best Regards, Shixiong Zhu 2015-03-16 17:40 GMT+08:00 Jianshi Huang jianshi.hu...@gmail.com: I created a JIRA

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
: We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
directories either. Typically, like in YARN, you would a number of directories (on different disks) mounted and configured for local storage for jobs. On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Unfortunately /tmp mount is really small in our environment. I

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... look like

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I've confirmed it only failed in console started

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
[info] try { f } finally { println(Elapsed: + (now - start)/1000.0 + s) } [info] } [info] [info] @transient val sqlc = new org.apache.spark.sql.SQLContext(sc) [info] implicit def sqlContext = sqlc [info] import sqlc._ Jianshi On Fri, Mar 13, 2015 at 3:10 AM, Jianshi Huang jianshi.hu...@gmail.com

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I'm almost certain the problem is the ClassLoader. So adding

  1   2   >