date:20151118

Re: orc read issue n spark

2015-11-18 Thread Reynold Xin

What do you mean by starts delay scheduling? Are you saying it is no longer
doing local reads?

If that's the case you can increase the spark.locality.read timeout.

On Wednesday, November 18, 2015, Renu Yadav  wrote:

> Hi ,
> I am using spark 1.4.1 and saving orc file using
> df.write.format("orc").save("outputlocation")
>
> outputloation size 440GB
>
> and while reading df.read.format("orc").load("outputlocation").count
>
>
> it has 2618 partitions .
> the count operation runs fine uptil 2500 but starts delay scheduling after
> that which results in slow performance.
>
> *If anyone has any idea on this.Please do reply as I need this  very
> urgent*
>
> Thanks in advance
>
>
> Regards,
> Renu Yadav
>
>
>

Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov

You could write your own UDF isdate().



-- 
Ruslan Dautkhanov

On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani  wrote:

> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu  wrote:
>
>> ISDATE() is currently not supported.
>> Since it is SQL Server specific, I guess it wouldn't be added to Spark.
>>
>> On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>>
>>>  In MSSQL server suppprt "ISDATE()" function is used to fine current
>>> column values date or not?.  Is any possible to achieve current column
>>> values date or not?
>>>
>>>
>>> Regards,
>>> Ravi
>>>
>>
>>
>

Re: how can evenly distribute my records in all partition

2015-11-18 Thread prateek arora

Hi
Thanks for the help.
In my Case ...
I want to perform operation on 30 record per second using spark streaming.
and difference between key of records is around 33-34 ms and my RDD that
have 30 records already have 4 partition.
and right now my algo take around 400 ms to perform operation on 1 record .
so i want to distribute my records evenly so every executor perform
operation only on one record and my 1 second batch will be completed
without delay.


On Tue, Nov 17, 2015 at 7:50 PM, Sonal Goyal  wrote:

> Think about how you want to distribute your data and how your keys are
> spread currently. Do you want to compute something per day, per week etc.
> Based on that, return a partition number. You could use mod 30 or some such
> function to get the partitions.
> On Nov 18, 2015 5:17 AM, "prateek arora" 
> wrote:
>
>> Hi
>> I am trying to implement custom partitioner using this link
>> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
>> ( in link example key value is from 0 to (noOfElement - 1))
>>
>> but not able to understand how i  implement  custom partitioner  in my
>> case:
>>
>> my parent RDD have 4 partition and RDD key is : TimeStamp and Value is
>> JPEG Byte Array
>>
>>
>> Regards
>> Prateek
>>
>>
>> On Tue, Nov 17, 2015 at 9:28 AM, Ted Yu  wrote:
>>
>>> Please take a look at the following for example:
>>>
>>> ./core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala
>>> ./core/src/main/scala/org/apache/spark/Partitioner.scala
>>>
>>> Cheers
>>>
>>> On Tue, Nov 17, 2015 at 9:24 AM, prateek arora <
>>> prateek.arora...@gmail.com> wrote:
>>>
 Hi
 Thanks
 I am new in spark development so can you provide some help to write a
 custom partitioner to achieve this.
 if you have and link or example to write custom partitioner please
 provide to me.

 On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
 sabarish.sasidha...@manthan.com> wrote:

> You can write your own custom partitioner to achieve this
>
> Regards
> Sab
> On 17-Nov-2015 1:11 am, "prateek arora" 
> wrote:
>
>> Hi
>>
>> I have a RDD with 30 record ( Key/value pair ) and running 30
>> executor . i
>> want to reparation this RDD in to 30 partition so every partition
>> get one
>> record and assigned to one executor .
>>
>> when i used rdd.repartition(30) its repartition my rdd in 30
>> partition but
>> some partition get 2 record , some get 1 record and some not getting
>> any
>> record .
>>
>> is there any way in spark so i can evenly distribute my record in all
>> partition .
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

>>>
>>

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread Ted Yu

Interesting.

I will watching your PR.

On Wed, Nov 18, 2015 at 7:51 AM, 임정택  wrote:

> Ted,
>
> I suspect I hit the issue
> https://issues.apache.org/jira/browse/SPARK-11818
> Could you refer the issue and verify that it makes sense?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2015-11-18 20:32 GMT+09:00 Ted Yu :
>
>> Here is related code:
>>
>>   private static void checkDefaultsVersion(Configuration conf) {
>>
>> if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
>> return;
>>
>> String defaultsVersion = conf.get("hbase.defaults.for.version");
>>
>> String thisVersion = VersionInfo.getVersion();
>>
>> if (!thisVersion.equals(defaultsVersion)) {
>>
>>   throw new RuntimeException(
>>
>> "hbase-default.xml file seems to be for an older version of
>> HBase (" +
>>
>> defaultsVersion + "), this version is " + thisVersion);
>>
>> null means that "hbase.defaults.for.version" was not set in the other
>> hbase-default.xml
>>
>> Can you retrieve the classpath of Spark task so that we can have more
>> clue ?
>>
>>
>> Cheers
>>
>> On Tue, Nov 17, 2015 at 10:06 PM, 임정택  wrote:
>>
>>> Ted,
>>>
>>> Thanks for the reply.
>>>
>>> My fat jar has dependency with spark related library to only spark-core
>>> as "provided".
>>> Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in
>>> spark-example module.
>>>
>>> And if there're two hbase-default.xml in the classpath, should one of
>>> them be loaded, instead of showing (null)?
>>>
>>> Best,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>
>>> 2015-11-18 13:50 GMT+09:00 Ted Yu :
>>>
 Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
 and another for 0.98.7-hadoop2 (used by Spark)

 You can specify hbase.defaults.for.version.skip as true in your
 hbase-site.xml

 Cheers

 On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:

> Hi all,
>
> I'm evaluating zeppelin to run driver which interacts with HBase.
> I use fat jar to include HBase dependencies, and see failures on
> executor level.
> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>
> I loaded fat jar via --jars option,
>
> > ./bin/spark-shell --jars hbase-included-assembled.jar
>
> and run driver code using provided SparkContext instance, and see
> failures from spark-shell console and executor logs.
>
> below is stack traces,
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 55 in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in 
> stage 0.0 (TID 281, ): java.lang.NoClassDefFoundError: 
> Could not initialize class 
> org.apache.hadoop.hbase.client.HConnectionManager
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
>

Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta

TD thank you for your reply.

I agree on data store requirement. I am using HBase as an underlying store.

So for every batch interval of say 10 seconds

- Calculate the time dimension ( minutes, hours, day, week, month and quarter ) 
along with other dimensions and metrics
- Update relevant base table at each batch interval for relevant metrics for a 
given set of dimensions.

Only caveat I see is I’ll have to update each of the different roll up table 
for each batch window.

Is this a valid approach for calculating time series aggregation?

Regards
SM

For minutes level aggregates I have set up a streaming window say 10 seconds 
and storing minutes level aggregates across multiple dimension in HBase at 
every window interval. 

> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
> 
> For this sort of long term aggregations you should use a dedicated data 
> storage systems. Like a database, or a key-value store. Spark Streaming would 
> just aggregate and push the necessary data to the data store. 
> 
> TD
> 
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta  > wrote:
> Hi,
> 
> I am working on requirement of calculating real time metrics and building 
> prototype  on Spark streaming. I need to build aggregate at Seconds, Minutes, 
> Hours and Day level.
> 
> I am not sure whether I should calculate all these aggregates as  different 
> Windowed function on input DStream or shall I use updateStateByKey function 
> for the same. If I have to use updateStateByKey for these time series 
> aggregation, how can I remove keys from the state after different time lapsed?
> 
> Please suggest.
> 
> Regards
> SM
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
>

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread 임정택

Ted,

I suspect I hit the issue https://issues.apache.org/jira/browse/SPARK-11818
Could you refer the issue and verify that it makes sense?

Thanks,
Jungtaek Lim (HeartSaVioR)

2015-11-18 20:32 GMT+09:00 Ted Yu :

> Here is related code:
>
>   private static void checkDefaultsVersion(Configuration conf) {
>
> if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
> return;
>
> String defaultsVersion = conf.get("hbase.defaults.for.version");
>
> String thisVersion = VersionInfo.getVersion();
>
> if (!thisVersion.equals(defaultsVersion)) {
>
>   throw new RuntimeException(
>
> "hbase-default.xml file seems to be for an older version of HBase
> (" +
>
> defaultsVersion + "), this version is " + thisVersion);
>
> null means that "hbase.defaults.for.version" was not set in the other
> hbase-default.xml
>
> Can you retrieve the classpath of Spark task so that we can have more clue
> ?
>
>
> Cheers
>
> On Tue, Nov 17, 2015 at 10:06 PM, 임정택  wrote:
>
>> Ted,
>>
>> Thanks for the reply.
>>
>> My fat jar has dependency with spark related library to only spark-core
>> as "provided".
>> Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in
>> spark-example module.
>>
>> And if there're two hbase-default.xml in the classpath, should one of
>> them be loaded, instead of showing (null)?
>>
>> Best,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>
>> 2015-11-18 13:50 GMT+09:00 Ted Yu :
>>
>>> Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
>>> and another for 0.98.7-hadoop2 (used by Spark)
>>>
>>> You can specify hbase.defaults.for.version.skip as true in your
>>> hbase-site.xml
>>>
>>> Cheers
>>>
>>> On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:
>>>
 Hi all,

 I'm evaluating zeppelin to run driver which interacts with HBase.
 I use fat jar to include HBase dependencies, and see failures on
 executor level.
 I thought it is zeppelin's issue, but it fails on spark-shell, too.

 I loaded fat jar via --jars option,

 > ./bin/spark-shell --jars hbase-included-assembled.jar

 and run driver code using provided SparkContext instance, and see
 failures from spark-shell console and executor logs.

 below is stack traces,

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
 in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
 initialize class org.apache.hadoop.hbase.client.HConnectionManager
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)

Unable to load native-hadoop library for your platform - already loaded in another classloader

2015-11-18 Thread Deenar Toraskar

I want to make sure we use short-circuit local reads for performance. I
have set the LD_LIBRARY_PATH correctly, confirmed that the native libraries
match our platform (i.e. are 64 bit and are loaded successfully). When I
start spark, i get the following message after increasing the logging level
for the relevant classes.

*15/11/18 17:47:23 DEBUG NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: Native Library
/usr/hdp/2.3.2.0-2950/hadoop/lib/native/libhadoop.so.1.0.0 already loaded
in another classloader*

Any idea what might be causing it and how to resolve this.

Regards
Deenar

[spark@edgenode1 spark-1.5.2-bin-hadoop2.6]$ bin/spark-shell
15/11/18 17:46:56 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...
15/11/18 17:46:56 DEBUG NativeCodeLoader: Loaded the native-hadoop library
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/18 17:47:00 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
15/11/18 17:47:00 WARN Utils: Service 'SparkUI' could not bind on port
4041. Attempting port 4042.
15/11/18 17:47:00 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
Spark context available as sc.
15/11/18 17:47:23 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...
*15/11/18 17:47:23 DEBUG NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: Native Library
/usr/hdp/2.3.2.0-2950/hadoop/lib/native/libhadoop.so.1.0.0 already loaded
in another classloader*
15/11/18 17:47:23 DEBUG NativeCodeLoader:
java.library.path=/usr/hdp/2.3.2.0-2950/hadoop/lib/native/:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
15/11/18 17:47:23 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/11/18 17:47:24 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.

72 matches

Mail list logo