Spark is not adding any STAT meta in parquet files in Version 1.6.x.
Scanning all files for filter.
(1 to 30).map(i => (i, i.toString)).toDF("a",
"b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics")
./parquet-meta /user/hive/warehouse/metrics/*.parquet
file:
file:/user/h
Hi All ,
I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my
data set as parquet data and then querying it . Query is executing fine But
when I checked the files generated by spark, I found statistics(min/max) is
missing for all the columns . And hence filters are not pushed
; things better.
>
>
> Regards,
> Gourav
>
> On Thu, Aug 3, 2017 at 11:09 AM, Rabin Banerjee <
> dev.rabin.baner...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying to create a DataSet from DataFrame, where dataframe has been
>> created successfu
Hi All,
I am trying to create a DataSet from DataFrame, where dataframe has been
created successfully, and using the same bean I am trying to create dataset.
But when I am running it, Dataframe is created as expected. I am able to
print the content as well. But not the dataset. The DataSet is hav
running on default pool.
Setting Receiver using : ssc.receiverStream(MyCustomReceiver())
Any Help ?
Regards,
Rabin Banerjee
running on default pool.
Setting Receiver using : ssc.receiverStream(MyCustomReceiver())
Any Help ?
Regards,
Rabin Banerjee
running on default pool.
Setting Receiver using : ssc.receiverStream(MyCustomReceiver())
Any Help ?
Regards,
Rabin Banerjee
Thanks , It worked !!
On Mon, Dec 19, 2016 at 5:55 PM, Dhaval Modi wrote:
>
> Replace with ":"
>
> Regards,
> Dhaval Modi
>
> On 19 December 2016 at 13:10, Rabin Banerjee > wrote:
>
>> HI All,
>>
>> I am trying to save data from Spark i
HI All,
I am trying to save data from Spark into HBase using saveHadoopDataSet
API . Please refer the below code . Code is working fine .But the table is
getting stored in the default namespace.how to set the NameSpace in the
below code?
wordCounts.foreachRDD ( rdd => {
val conf = HBaseCon
u don't change the StorageLevel.
>
>
> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L166
>
>
>
> Yong
>
> --
> *From:* Rabin Banerjee
> *Sent:* Friday, November 18, 2016 10:36 AM
&g
HI ,
In order to do that you can write code to read/list a HDFS directory first
, then list its sub-directories . In this way using custom logic ,first
identify the latest year/month/version , then read the avro in that dir in
a DF, then add year/month/version to that DF using withColumn.
Regard
++Stuart
val colList = df.columns
can be used
On Fri, Nov 18, 2016 at 8:03 PM, Stuart White
wrote:
> Is this what you're looking for?
>
> val df = Seq(
> (1, "A"),
> (1, "B"),
> (1, "C"),
> (2, "D"),
> (3, "E")
> ).toDF("foo", "bar")
>
> val colList = Seq("foo", "bar")
> df.sort(colL
g
?? Shall I maintain a global hashmap to handle that ? something like
Map[String,DataFrame] ??
Regards,
Rabin Banerjee
is a good option .
Can anyone share the best way to implement this in Spark .?
Regards,
Rabin Banerjee
On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers wrote:
> Just realized you only want to keep first element. You can do this without
> sorting by doing something similar to min or max op
claims it will work except in Spark 1.5.1 and 1.5.2 .*
*I need a bit elaboration of how internally spark handles it ? also is
it more efficient than using a Window function ?*
*Thanks in Advance ,*
*Rabin Banerjee*
Trust me, Only thing that can help you in your situation is SQOOP oracle
direct connector which is known as ORAOOP. Spark cannot do everything ,
you need a OOZIE workflow which will trigger sqoop job with oracle direct
connector to pull the data then spark batch to process .
Hope it helps !!
On
of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 06:21, Rab
HI All,
I am writing and executing a Spark Batch program which only use SPARK-SQL
, But it is taking lot of time and finally giving GC overhead .
Here is the program ,
1.Read 3 files ,one medium size and 2 small files, and register them as DF.
2.
fire sql with complex aggregation and windo
++ crontab :)
On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich wrote:
> Another option is Oozie with the spark action:
> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
>
> On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote:
>
> You can use following options
>
> * spark-submit from
++Deepak,
There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In
which you can customize(filename and many things ...) the way you want to
save it. :)
Happy Sparking
Regards,
Rabin Banerjee
On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma
wrote:
> In spark st
Hi Vaibhav,
Please check your yarn configuration and make sure you have available
resources .Please try creating multiple queues ,And submit job on queues.
--queue thequeue
Regards,
Rabin Banerjee
On Wed, Jul 20, 2016 at 12:05 PM, vaibhavrtk wrote:
> I have a silly question:
>
> Do
basis.
Regards,
Rabin Banerjee
On Wed, Jul 20, 2016 at 12:06 PM, Yu Wei wrote:
> I need to write all data received from MQTT data into hbase for further
> processing.
>
> They're not final result. I also need to read the data from hbase for
> analysis.
>
>
> Is it good
Hi Divya ,
Try,
val df = sqlContext.sql("select from_unixtime(ts,'-MM-dd') as `ts` from mr")
Regards,
Rabin
On Wed, Jul 20, 2016 at 12:44 PM, Divya Gehlot
wrote:
> Hi,
> Could somebody share example of writing and calling udf which converts
> unix tme stamp to date tiime .
>
>
> Thanks,
>
to a MS-Access file to access via JDBC ,
Regards,
Rabin Banerjee
On Wed, Jul 20, 2016 at 2:12 PM, Yogesh Vyas wrote:
> Hi,
>
> I am trying to load and read excel sheets from HDFS in sparkR using
> XLConnect package.
> Can anyone help me in finding out how to read xls files from H
HI ,
Please check your deploy mode and master , For example if you want to
deploy in yarn cluster you should use --master yarn-cluster , if you want
to do it on yarn client mode you should use --master yarn-client .
Please note for your case deploying yarn-cluster will be better as cluster
mode
the latest 4 ts data on get(key). Spark streaming will get
the ID from Kafka, then read the data from HBASE using get(ID). This will
eliminate usage of Windowing from Spark-Streaming . Is it good to use ?
Regards,
Rabin Banerjee
On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger wrote:
> Unless
" I am working on a spark application that requires the ability to run a
function on each node in the cluster
"
--
Use Apache Ignite instead of Spark. Trust me it's awesome for this use case.
Regards,
Rabin Banerjee
On Jul 19, 2016 3:27 AM, "joshuata" wrote:
Just to add ,
I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 ,
every time .
Also I want to know , If I want to fetch a specific offset range for Batch
processing, is there any option for doing that ?
On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee <
dev.rabin.ba
HI All,
I have 1000 kafka topics each storing messages for different devices . I
want to use the direct approach for connecting kafka from Spark , in which
I am only interested in latest 200 messages in the Kafka .
How do I do that ?
Thanks.
Ya , I mean dump in hdfs as a file ,via yarn cluster mode .
On Jul 8, 2016 3:10 PM, "Yu Wei" wrote:
> How could I dump data into text file? Writing to HDFS or other approach?
>
>
> Thanks,
>
> Jared
> ------
> *From:* Rabin Banerjee
>
// scalastyle:off println
> println("---")
> println("Time: " + time)
> println("---")
> firstNum.take(num).foreach(println)
> if (firstNum.length
It's not required ,
*Simplified Parallelism:* No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many RDD
partitions as there are Kafka partitions to consume, which will all read
data from Kafka in parallel. So there is a one-to-one map
In yarn cluster mode , Driver is running in AM , so you can find the logs
in that AM log . Open rersourcemanager UI , and check for the Job and logs.
or yarn logs -applicationId
In yarn client mode , the driver is the same JVM from where you are
launching ,,So you are getting it in the log .
On
HI Buddy,
I sued both but DataFrame.write.jdbc is old, and will work if provide
table name , It wont work if you provide custom queries . Where
as DataFrame.write.format is more generic as well as working perfectly with
not only table name but also custom queries . Hence I recommend to use
the
Checked in spark-shell with spark 1.5.0
scala> val emmpdat = sc.textFile("empfile");
emmpdat: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[23] at
textFile at :21
scala> case class EMP (id:Int , name : String , deptId: Int)
defined class EMP
scala> val empdf = emmpdat.map((f) => {val ff=f.
#x27;d have to collect the RDD and then
> println, really. Also see DStream.print()
>
> On Wed, Jul 6, 2016 at 1:07 PM, Rabin Banerjee
> wrote:
> > It's not working because , you haven't collected the data.
> >
> > Try something like
> >
> > DS
It's not working because , you haven't collected the data.
Try something like
DStream.forEachRDD((rdd)=> {rdd.foreach(println)})
Thanks,
Rabin
On Wed, Jul 6, 2016 at 5:05 PM, Yu Wei wrote:
> Hi guys,
>
>
> It seemed that when launching application via yarn on single node,
> JavaDStream.print
Hi Team,
I am new to spark Streaming , I am trying to write a spark Streaming
application , where the Calculation of incoming data will be performed in "R"
in the micro batching .
But I want to make wordCounts.mapToPair parallel where wordCounts is the output
of groupByKey, How can I ensure th
39 matches
Mail list logo