I don’t think there is performance difference between 1.x API and 2.x API.
but it’s not a big issue for your change, only
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java
.write not .write()
> On Dec 9, 2015, at 5:37 PM, Divya Gehlot wrote:
>
> Hi,
> I am using spark 1.4.1 .
> I am getting error when persisting spark dataframe output to hive
> scala>
>
val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" )
Azuryy Yu
Sr. Infrastructure Engineer
cel: 158-0164-9103
wetchat: azuryy
On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj <
prashant2006s...@gmail.com> wrote:
> Hi
>
> I have two columns in my json which can have null,
>
> <https://in.linkedin.com/in/ramkumarcs31>
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <fengdo...@everstring.com
> <mailto:fengdo...@everstring.com>> wrote:
> Can you detail your question? what looks like your previous batch and the
> current bat
Can you detail your question? what looks like your previous batch and the
current batch?
> On Dec 8, 2015, at 3:52 PM, Ramkumar V wrote:
>
> Hi,
>
> I'm running java over spark in cluster mode. I want to apply filter on
> javaRDD based on some previous batch
https://github.com/nerdammer/spark-hbase-connector
This is better and easy to use.
> On Dec 9, 2015, at 3:04 PM, censj wrote:
>
> hi all,
> now I using spark,but I not found spark operation hbase open source. Do
> any one tell me?
>
Can you try like this in your sbt:
val spark_version = "1.5.2"
val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact
= "servlet-api")
val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty")
libraryDependencies ++= Seq(
"org.apache.spark" %%
If your RDD is JSON format, that’s easy.
val df = sqlContext.read.json(rdd)
df.saveAsTable(“your_table_name")
> On Dec 7, 2015, at 5:28 PM, Divya Gehlot wrote:
>
> Hi,
> I am new bee to Spark.
> Could somebody guide me how can I persist my spark RDD results in Hive
I suppose your output data is “ORC”, and want to save to hive database: test,
external table name is : testTable
import scala.collection.immutable
sqlContext.createExternalTable(“test.testTable",
"org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata"))
> On Dec 7, 2015, at 5:28
refer here:
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
of section:
Example 4-27. Python custom partitioner
> On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote:
>
> I'm not a python expert, so I'm wondering if anybody has a
Don’t do Join firstly.
broadcast your small RDD,
val bc = sc.broadcast(small_rdd)
then large_dd.filter(x.key in bc.value).map( x => {
bc.value.other_fileds + x
}).distinct.groupByKey
> On Dec 7, 2015, at 1:41 PM, Z Z wrote:
>
> I have two RDDs, one really
Yes. it results to a shuffle.
> On Dec 4, 2015, at 6:04 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle
> does it not?
>
> 2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@evers
There are many ways, one simple is:
such as: you want to know how many rows for each month:
sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
the output looks like:
monthcount
201411100
201412200
hopes help.
> On Dec 4, 2015, at 5:53 PM, Yiannis
It depends on many situations:
1) what’s your data format? csv(text) or ORC/parquet?
2) Did you have Data warehouse to summary/cluster your data?
if your data is text or you query for the raw data, It should be slow, Spark
cannot do much to optimize your job.
> On Dec 2, 2015, at 9:21
Hi
you can try:
if your table under location “/test/table/“ on HDFS
and has partitions:
“/test/table/dt=2012”
“/test/table/dt=2013”
df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
> On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote:
>
>
hiveContext.read.format(“orc”).load(“bypath/*”)
> On Nov 24, 2015, at 1:07 PM, Renu Yadav wrote:
>
> Hi ,
>
> I am using dataframe and want to load orc file using multiple directory
> like this:
> hiveContext.read.format.load("mypath/3660,myPath/3661")
>
> but it is not
just simple as:
val df = sqlContext.sql(“select * from table”)
or
val df = sqlContext.read.json(“hdfs_path”)
> On Nov 24, 2015, at 3:09 AM, spark_user_2015 wrote:
>
> Dear all,
>
> is the following usage of the Dataframe constructor correct or does it
> trigger any side
Hi,
I found if the column value is too long, spark shell only show a partial result.
such as:
sqlContext.sql("select url from tableA”).show(10)
it cannot show the whole URL here. so how to adjust it? Thanks
-
To
The simplest way is remove all “provided” in your pom.
then ‘sbt assembly” to build your final package. then get rid of ‘—jars’
because assembly already includes all dependencies.
> On Nov 18, 2015, at 2:15 PM, Jack Yang wrote:
>
> So weird. Is there anything wrong with
Hi,
we use ‘Airflow' as our job workflow scheduler.
> On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote:
>
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with command
> job type.
> I see that when I press kill in azkaban portal on a
Yes, you can submit job remotely.
> On Nov 19, 2015, at 10:10 AM, Vikram Kone <vikramk...@gmail.com> wrote:
>
> Hi Feng,
> Does airflow allow remote submissions of spark jobs via spark-submit?
>
> On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <fengdo...@evers
Can you try : new PixelGenerator(startTime, endTime) ?
> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu wrote:
>
> I want to pass two parameters into new java class from rdd.mapPartitions(),
> the code like following.
> ---Source Code
>
> Main method:
>
> /*the
Just make PixelGenerator as a nested static class?
> On Nov 16, 2015, at 1:22 PM, Zhang, Jingyu wrote:
>
> Fengdong
6 November 2015 at 16:05, Fengdong Yu <fengdo...@everstring.com
> <mailto:fengdo...@everstring.com>> wrote:
> Can you try : new PixelGenerator(startTime, endTime) ?
>
>
>
>> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu <jingyu.zh...@news.com.au
>> <mailto:j
The code looks good. can you check your ‘import’ in your code? because it
calls ‘honeywell.test’?
> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote:
>
> Hi,
>
> While I am trying to read a json file using SQLContext, i get the
> following error:
>
> Exception in
Ignore my inputs, I think HiveSpark.java is your main method located.
can you paste the whole pom.xml and your code?
> On Nov 16, 2015, at 3:39 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
>
> The code looks good. can you check your ‘import’ in your code? beca
And, also make sure your scala version is 2.11 for your build.
> On Nov 16, 2015, at 3:43 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
>
> Ignore my inputs, I think HiveSpark.java is your main method located.
>
> can you paste the whole pom.xml and your code?
>
what’s your SQL?
> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote:
>
> Hi,
>
> While I am trying to read a json file using SQLContext, i get the
> following error:
>
> Exception in thread "main" java.lang.NoSuchMethodError:
>
This is the most simplest announcement I saw.
> On Nov 11, 2015, at 12:49 AM, Reynold Xin wrote:
>
> Hi All,
>
> Spark 1.5.2 is a maintenance release containing stability fixes. This release
> is based on the branch-1.5 maintenance branch of Spark. We *strongly
>
6.0rc7 manually ?
> On Nov 9, 2015, at 9:34 PM, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
> I am using the following:
>
>
>
> com.twitter
> parquet-avro
> 1.6.0
>
>
> On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo
Which Spark version used?
It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
> On Nov 9, 2015, at 3:43 PM, swetha wrote:
>
> Hi,
>
> I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> Please find below the code and the Warning
Does this released with Spark1.*? or still kept in the trunk?
> On Oct 27, 2015, at 6:22 PM, Adrian Tanase wrote:
>
> Also I just remembered about cloudera’s contribution
> http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
>
Also, please remove the HBase related to the Scala Object, this will resolve
the serialize issue and avoid open connection repeatedly.
and remember close the table after the final flush.
> On Oct 28, 2015, at 10:13 AM, Ted Yu wrote:
>
> For #2, have you checked task
How many partitions you generated?
if Millions generated, then there is a huge memory consumed.
> On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote:
>
> Hi guys,
>
> I mentioned that the partitions are generated so I tried to read the
> partition data from it. The driver
Don’t recommend this code style, you’d better brace the function block.
val testLabels = testRDD.map { case (file, text) => {
val topic = file.split("/").takeRight(2).head
newsgroupsMap(topic)
} }
> On Oct 14, 2015, at 15:46, Nick Pentreath wrote:
>
> Hi there.
oh,
Yes. Thanks much.
> On Oct 14, 2015, at 18:47, Akhil Das wrote:
>
> com.holdenkarau.spark.testing
Can you search the mail-archive before asked the question? at least search for
how ask the question.
nobody can give your answer if you don’t paste your SQL or SparkSQL code.
> On Oct 14, 2015, at 17:40, Andy Zhao wrote:
>
> Hi guys,
>
> I'm testing sparkSql 1.5.1,
Hi,
How to add dependency in build.sbt if I want to use SharedSparkContext?
I’ve added spark-core, but it doesn’t work.(cannot find SharedSparkContext)
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
38 matches
Mail list logo