unsubscribe

2022-08-12 Thread Alexey Milogradov



unsubscribe

2020-02-19 Thread Alexey Kovyazin




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark jdbc postgres numeric array

2019-01-05 Thread Alexey
Hi, 

I also filed a jira yesterday:
https://issues.apache.org/jira/browse/SPARK-26538

Looks like one needs to be closed as duplicate. Sorry for the late update.

Best regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark jdbc postgres numeric array

2018-12-31 Thread Alexey
Hi,

I came across strange behavior when dealing with postgres columns of type 
numeric[] using Spark 2.3.2, PostgreSQL 10.4, 9.6.9.
Consider the following table definition:

create table test1
(
   v  numeric[],
   d  numeric
);

insert into test1 values('{.222,.332}', 222.4555);

When reading the table into a Dataframe, I get the following schema:

root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

Notice that for both columns precision and scale were not specified, but in 
case of the array element I got both set to 0, while in the other case defaults 
were set.

Later, when I try to read the Dataframe, I get the following error:

java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474)
...

I would expect to get array elements of type decimal(38,18) and no error when 
reading in this case.
Should this be considered a bug? Is there a workaround other than changing the 
column array type definition to include explicit precision and scale?

Best regards,
Alexey

-- реклама ---
Поторопись зарегистрировать самый короткий почтовый адрес @i.ua
https://mail.i.ua/reg - и получи 1Gb для хранения писем

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



mapWithState() without data checkpointing

2016-09-29 Thread Alexey Kharlamov
Hello!

I would like to avoid data checkpointing when processing a DStream. Basically, 
we do not care if the intermediate data are lost. 

Is there a way to achieve that? Is there an extension point or class embedding 
all associated activities?

Thanks!

Sincerely yours,
—
Alexey Kharlamov
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Alexey Svyatkovskiy
Hi Yanbo,

Thanks for your reply. I will keep an eye on that pull request.
For now, I decided to just put my code inside org.apache.spark.ml to be
able to access private classes.

Thanks,
Alexey

On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang <yblia...@gmail.com> wrote:

> It seams that VectorUDT is private and can not be accessed out of Spark
> currently. It should be public but we need to do some refactor before make
> it public. You can refer the discussion at https://github.com/apache/
> spark/pull/12259 .
>
> Thanks
> Yanbo
>
> 2016-08-16 9:48 GMT-07:00 alexeys <alex...@princeton.edu>:
>
>> I am writing an UDAF to be applied to a data frame column of type Vector
>> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have
>> to
>> go back and forth between dataframe and RDD.
>>
>> Inside the UDAF, I have to specify a data type for the input, buffer, and
>> output (as usual). VectorUDT is what I would use with
>> spark.mllib.linalg.Vector:
>> https://github.com/apache/spark/blob/master/mllib/src/main/
>> scala/org/apache/spark/mllib/linalg/Vectors.scala
>>
>> However, when I try to import it from spark.ml instead: import
>> org.apache.spark.ml.linalg.VectorUDT
>> I get a runtime error (no errors during the build):
>>
>> class VectorUDT in package linalg cannot be accessed in package
>> org.apache.spark.ml.linalg
>>
>> Is it expected/can you suggest a workaround?
>>
>> I am using Spark 2.0.0
>>
>> Thanks,
>> Alexey
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Ideas to put a Spark ML model in production

2016-07-03 Thread Alexey Pechorin
>From my personal experience - we're reading the metadata of the features
column in the dataframe to extract mapping of the feature indices to the
original feature name, and use this mapping to translate the model
coefficients into a JSON string that maps the original feature names to
their weights. The production environment has a simple code that evaluates
a logistic model based on this JSON string and the real inputs.

I would be very interested to find a more straight-forward approach to
export the model into a format that's readable by systems without Spark
installed on them.

On Sat, Jul 2, 2016 at 10:45 AM, Yanbo Liang  wrote:

> Let's suppose you have trained a LogisticRegressionModel and saved it at
> "/tmp/lr-model". You can copy the directory to production environment and
> use it to make prediction on users new data. You can refer the following
> code snippets:
>
> val model = LogisiticRegressionModel.load("/tmp/lr-model")
> val data = newDataset
> val prediction = model.transform(data)
>
> However, usually we save/load PipelineModel which include necessary
> feature transformers and model training process rather than the single
> model, but they are similar operations.
>
> Thanks
> Yanbo
>
> 2016-06-23 10:54 GMT-07:00 Saurabh Sardeshpande :
>
>> Hi all,
>>
>> How do you reliably deploy a spark model in production? Let's say I've
>> done a lot of analysis and come up with a model that performs great. I have
>> this "model file" and I'm not sure what to do with it. I want to build some
>> kind of service around it that takes some inputs, converts them into a
>> feature, runs the equivalent of 'transform', i.e. predict the output and
>> return the output.
>>
>> At the Spark Summit I heard a lot of talk about how this will be easy to
>> do in Spark 2.0, but I'm looking for an solution sooner. PMML support is
>> limited and the model I have can't be exported in that format.
>>
>> I would appreciate any ideas around this, especially from personal
>> experiences.
>>
>> Regards,
>> Saurabh
>>
>
>


Re: cache datframe

2016-06-16 Thread Alexey Pechorin
What's the reason for your first cache call? It looks like you've used the
data only once to transform it without reusing the data, so there's no
reason for the first cache call, and you need only the second call (and
that also depends on the rest of your code).

On Thu, Jun 16, 2016 at 3:17 PM, pseudo oduesp 
wrote:

> hi,
> if i cache same data frame and transforme and add collumns i should cache
> second times
>
> df.cache()
>
>   transforamtion
>   add new columns
>
> df.cache()
> ?
>
>


sliding Top N window

2016-03-11 Thread Yakubovich, Alexey
Good day,

I have a following task: a stream of “page vies” coming to kafka topic. Each 
view contains list of product Ids from a visited page. The task: to have in 
“real time” Top N product.

I am interested in some solution that would require minimum intermediate writes 
… So  need to build a sliding window for top N product, where the product 
counters dynamically changes and window should present the TOP product for the 
specified period of time.

I believe there is no way to avoid maintaining all product counters counters in 
memory/storage.  But at least I would like to do all logic, all calculation on 
a fly, in memory, not spilling multiple RDD from memory to disk.

So I believe I see one way of doing it:
   Take, msg from kafka take and line up, all elementary action (increase by 1 
the counter for the product PID )
  Each action will be implemented as a call to HTable.increment()  // or 
easier, with incrementColumnValue()…
  After each increment I can apply my own operation “offer” would provide that 
only top N products with counters are kept in another Hbase table (also with 
atomic operations).
 But there is another stream of events: decreasing product counters when view 
expires the legth of sliding window….

So my question: does anybody know/have and can share the piece code/ know how: 
how to implement “sliding Top N window” better.
If nothing will be offered, I will share what I will do myself.

Thank you
Alexey

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Alexey Ponkin
Hi

I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - 
read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra.
The problem is that when I increase number of incoming messages in topic the 
job is starting to fail with kafka.common.OffsetOutOfRangeException.
Job fails starting from 100 events per second.

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok.
Spark 1.4.1 on yarn

Here is my application
I have 4 different Kafka topics(different object streams)

type Edge = (String,String)

val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty 
).map( toEdge )
val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty 
).map( toEdge )
val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty 
).map( toEdge )

val u = a union b union c

val source = u.window(Seconds(600), Seconds(10))

val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty 
).map( toEdge )

val joinResult = source.rightOuterJoin( z )
joinResult.foreachRDD { rdd=>
  rdd.foreachPartition { partition =>
  // save to result topic in kafka
   }
 }

The 'window' function in the code above is constantly growing,
no matter how many events appeared in corresponding kafka topics

but if I change one line from   

val source = u.window(Seconds(600), Seconds(10))

to 

val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8))

val source = u.transform(_.partitionBy(partitioner.value) 
).window(Seconds(600), Seconds(10))

Everything works perfect.

Perhaps the problem was in WindowedDStream

I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) 
instead of UnionRDD.

Nonetheless I did not see any hints about such a bahaviour in doc.
Is it a bug or absolutely normal behaviour?





08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>:
>  Can you provide more info (what version of spark, code example)?
>
>  On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote:
>>  Hi,
>>
>>  I have an application with 2 streams, which are joined together.
>>  Stream1 - is simple DStream(relativly small size batch chunks)
>>  Stream2 - is a windowed DStream(with duration for example 60 seconds)
>>
>>  Stream1 and Stream2 are Kafka direct stream.
>>  The problem is that according to logs window operation is constantly 
>> increasing(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
>>  And also I see gap in pocessing window(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
>>  in logs there are no events in that period.
>>  So what is happen in that gap and why window is constantly insreasing?
>>
>>  Thank you in advance
>>
>>  -
>>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi,

I have an application with 2 streams, which are joined together.
Stream1 - is simple DStream(relativly small size batch chunks)
Stream2 - is a windowed DStream(with duration for example 60 seconds)

Stream1 and Stream2 are Kafka direct stream. 
The problem is that according to logs window operation is constantly 
increasing(http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
And also I see gap in pocessing window(http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
 in logs there are no events in that period.
So what is happen in that gap and why window is constantly insreasing?

Thank you in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Alexey Ponkin
Hi,

I have the following code

object MyJob extends org.apache.spark.Logging{
...
 val source: DStream[SomeType] ...

 source.foreachRDD { rdd =>
  logInfo(s"""+++ForEachRDD+++""")
  rdd.foreachPartition { partitionOfRecords =>
logInfo(s"""+++ForEachPartition+++""")
  }
  }

I was expecting to see both log messages in job log.
But unfortunately you will never see string '+++ForEachPartition+++' in logs, 
cause block foreachPartition will never execute.
And also there is no error message or something in logs.
I wonder is this a bug or known behavior? 
I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
fails with no messages?
What to use instead of org.apache.spark.Logging? in spark-streaming jobs?

P.S. running spark 1.4.1 (on yarn)

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Getting number of physical machines in Spark

2015-08-28 Thread Alexey Grishchenko
There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime

To check the current number of machines you can do something like this
(python):

import socket
machines = sc.parallelize(xrange(1000)).mapPartitions(lambda x:
[socket.gethostname()]).distinct().collect()


On Fri, Aug 28, 2015 at 9:09 PM, Jason ja...@jasonknight.us wrote:

 I've wanted similar functionality too: when network IO bound (for me I was
 trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
 api where I wouldn't have to try guess at the proper partitioning of a
 'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk
 from S3 )`.

 On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T 
 matthew.t.yo...@intel.com wrote:

 What’s the canonical way to find out the number of physical machines in a
 cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
 give me the number of cores, but I’m interested in the number of NICs.



 I’m writing a Spark streaming application to ingest from Kafka with the
 Receiver API and want to create one DStream per physical machine for read
 parallelism’s sake. How can I figure out at run time how many machines
 there are so I know how many DStreams to create?




-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com


Re: Help Explain Tasks in WebUI:4040

2015-08-28 Thread Alexey Grishchenko
It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external resource and this resource is timing out under the pressure. Also
call toDebugString on the problematic RDD before calling an action that
triggers the calculations, this would give you an understanding what your
execution tasks are really doing

On Fri, Aug 28, 2015 at 7:47 PM, Muler mulugeta.abe...@gmail.com wrote:

 I have a 7 node cluster running in standalone mode (1 executor per node,
 100g/executor, 18 cores/executor)

 Attached is the Task status for two of my nodes. I'm not clear why some of
 my tasks are taking too long:

1. [node sk5, green] task 197 took 35 mins while task 218 took less
than 2 mins. But if you look into the size of output size/records they have
almost same size. Even more strange, the size of shuffle spill for memory
and disk is 0 for task 197 and yet it is taking a long time

 Same issue for my other node (sk3, red)

 Can you please explain what is going on?

 Thanks,


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Alexey Grishchenko, http://0x0fff.com


Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Alexey Grishchenko
In my opinion aggragate+flatMap would work faster as it would make less
passes through the data. Would work like this:

import random

def agg(x,y):
x[0] += 1 if not y[1] else 0
x[1] += 1 if y[1] else 0
return x

# Source data
rdd  = sc.parallelize(xrange(10), 5)
rdd2 = rdd.map(lambda x: (x, random.choice([True, False]))).cache()

# Calculate counts for True and False
counts = rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1]))
# If filtering is needed
if counts[0]*10  counts[1]:
# Calculate sampling ratio
prob0 = float(counts[1])/10.0 / float(counts[0])
# Filter falses
rdd2  = rdd2.flatMap(lambda x: [x] if (x[1] or x[0] and random.random()
 prob0) else [])

# Count True and False again for validation - falses should be 10% of trues
rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1]))

On Fri, Aug 28, 2015 at 6:39 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Filter into true rdd and false rdd. Union true rdd and sample of false rdd.
 On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote:

 Hey,


 I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and
 randomly keep some Boolean:false rows.  And hope in the final result, the
 negative ones could be 10 times more than positive ones.


 What would be most efficient way to do this?

 Thanks,






-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com


Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Alexey Grishchenko
If the data is already in RDD, the easiest way to calculate min/max for
each column would be an aggregate() function. It takes 2 functions as
arguments - first is used to aggregate RDD values to your accumulator,
the second is used to merge two accumulators. This way both min and max for
all the columns in your RDD would be calculated in a single pass over it.
Here's an example in Python:

def agg1(x,y):
if len(x) == 0: x = [y,y]
return [map(min,zip(x[0],y)),map(max,zip(x[1],y))]

def agg2(x,y):
if len(x) == 0: x = y
return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))]

rdd  = sc.parallelize(xrange(10), 5)
rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)]))
rdd2.aggregate([], agg1, agg2)

What personally I would do in your case depends on what else you want to do
with the data. If you plan to run some more business logic on top of it and
you're more comfortable with SQL, it might worth registering this DataFrame
as a table and generating SQL query to it (generate a string with a series
of min-max calls). But to solve your specific problem I'd load your file
with textFile(), use map() transformation to split the string by comma and
convert it to the array of doubles, and call aggregate() on top of it just
like I've shown in the example above

On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz brk...@gmail.com wrote:

 Or you can just call describe() on the dataframe? In addition to min-max,
 you'll also get the mean, and count of non-null and non-NA elements as well.

 Burak

 On Fri, Aug 28, 2015 at 10:09 AM, java8964 java8...@hotmail.com wrote:

 Or RDD.max() and RDD.min() won't work for you?

 Yong

 --
 Subject: Re: Calculating Min and Max Values using Spark Transformations?
 To: as...@wso2.com
 CC: user@spark.apache.org
 From: jfc...@us.ibm.com
 Date: Fri, 28 Aug 2015 09:28:43 -0700


 If you already loaded csv data into a dataframe, why not register it as a
 table, and use Spark SQL
 to find max/min or any other aggregates? SELECT MAX(column_name) FROM
 dftable_name ... seems natural.





*JESSE CHEN*
Big Data Performance | IBM Analytics

Office:  408 463 2296
Mobile: 408 828 9068
Email:   jfc...@us.ibm.com



 [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
 all, I have a dataset which consist of large number of feature]ashensw
 ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
 number of features(columns). It is

 From: ashensw as...@wso2.com
 To: user@spark.apache.org
 Date: 08/28/2015 05:40 AM
 Subject: Calculating Min and Max Values using Spark Transformations?

 --



 Hi all,

 I have a dataset which consist of large number of features(columns). It is
 in csv format. So I loaded it into a spark dataframe. Then I converted it
 into a JavaRDDRow Then using a spark transformation I converted that
 into
 JavaRDDString[]. Then again converted it into a JavaRDDdouble[]. So
 now
 I have a JavaRDDdouble[]. So is there any method to calculate max and
 min
 values of each columns in this JavaRDDdouble[] ?

 Or Is there any way to access the array if I store max and min values to a
 array inside the spark transformation class?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com


Unsupported major.minor version 51.0

2015-08-11 Thread Yakubovich, Alexey
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 
1.8).
Well, I use JDK 1.7 on OS X Yosemite .  Both
java –verion //

java version 1.7.0_80

Java(TM) SE Runtime Environment (build 1.7.0_80-b15)

Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

and
echo $JAVA_HOME// 
/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home
show JDK 1.7.
But for the Spark 1.4.1.  (and for Spark 1.2.2, downloaded 07/10/2015, I have  
the same error when build with maven ()  (as sudo mvn -DskipTests -X clean 
package  abra.txt)

Exception in thread main java.lang.UnsupportedClassVersionError: 
org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0


Please help how to build the thing.

Thanks

Alexey

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


Can't build Spark 1.3

2015-06-02 Thread Yakubovich, Alexey
\
I downloaded the latest Spark (1.3.) from github. Then I tried to build it.
First for scala 2.10 (and hadoop 2.4):

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

That resulted in hangup after printing bunch of line like

[INFO] Dependency-reduced POM written at ……
INFO] Dependency-reduced -
Then I tried for scala 2.11

mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

That resulted in multiple compilation errors.

What I actually want is:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
-Phive-thriftserver -DskipTests clean package

Is it only me, who can’t build Spark 1.3?
And, is there any site  to download Spark prebuilt for Hadoop 2.5 and Hive?

Thank you for any help.
Alexey

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-29 Thread Alexey Zinoviev
I figured out that Logging is a DeveloperApi and it should not be used outside 
Spark code, so everything is fine now. Thanks again, Marcelo.

 On 24 Mar 2015, at 20:06, Marcelo Vanzin van...@cloudera.com wrote:
 
 From the exception it seems like your app is also repackaging Scala
 classes somehow. Can you double check that and remove the Scala
 classes from your app if they're there?
 
 On Mon, Mar 23, 2015 at 10:07 PM, Alexey Zinoviev
 alexey.zinov...@gmail.com wrote:
 Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it
 works only if I remove extends Logging from the object, with extends
 Logging it return:
 
 Exception in thread main java.lang.LinkageError: loader constraint
 violation in interface itable initialization: when resolving method
 App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader
 (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current
 class, App1$, and the class loader (instance of
 sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging
 have different Class objects for the type scala/Function0 used in the
 signature
at App1.main(App1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at
 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 Do you have any idea what's wrong with Logging?
 
 PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf
 spark.driver.userClassPathFirst=true --conf
 spark.executor.userClassPathFirst=true
 $HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar
 
 Thanks,
 Alexey
 
 
 On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote:
 
 You could build a far jar for your application containing both your
 code and the json4s library, and then run Spark with these two
 options:
 
  spark.driver.userClassPathFirst=true
  spark.executor.userClassPathFirst=true
 
 Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but
 that only works for executors.)
 
 
 On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev
 alexey.zinov...@gmail.com wrote:
 Spark has a dependency on json4s 3.2.10, but this version has several
 bugs
 and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
 build.sbt and everything compiled fine. But when I spark-submit my JAR
 it
 provides me with 3.2.10.
 
 
 build.sbt
 
 import sbt.Keys._
 
 name := sparkapp
 
 version := 1.0
 
 scalaVersion := 2.10.4
 
 libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
 provided
 
 libraryDependencies += org.json4s %% json4s-native % 3.2.11`
 
 
 plugins.sbt
 
 logLevel := Level.Warn
 
 resolvers += Resolver.url(artifactory,
 
 url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases;))(Resolver.ivyStylePatterns)
 
 addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)
 
 
 App1.scala
 
 import org.apache.spark.SparkConf
 import org.apache.spark.rdd.RDD
 import org.apache.spark.{Logging, SparkConf, SparkContext}
 import org.apache.spark.SparkContext._
 
 object App1 extends Logging {
  def main(args: Array[String]) = {
val conf = new SparkConf().setAppName(App1)
val sc = new SparkContext(conf)
println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
  }
 }
 
 
 
 sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4
 
 Is it possible to force 3.2.11 version usage?
 
 Thanks,
 Alexey
 
 
 
 --
 Marcelo
 
 
 
 
 
 -- 
 Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Thanks Ted, I'll try, hope there's no transitive dependencies on 3.2.10.

On Tue, Mar 24, 2015 at 4:21 AM, Ted Yu yuzhih...@gmail.com wrote:

 Looking at core/pom.xml :
 dependency
   groupIdorg.json4s/groupId
   artifactIdjson4s-jackson_${scala.binary.version}/artifactId
   version3.2.10/version
 /dependency

 The version is hard coded.

 You can rebuild Spark 1.3.0 with json4s 3.2.11

 Cheers

 On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev 
 alexey.zinov...@gmail.com wrote:

 Spark has a dependency on json4s 3.2.10, but this version has several
 bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
 build.sbt and everything compiled fine. But when I spark-submit my JAR it
 provides me with 3.2.10.


 build.sbt

 import sbt.Keys._

 name := sparkapp

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
 provided

 libraryDependencies += org.json4s %% json4s-native % 3.2.11`


 plugins.sbt

 logLevel := Level.Warn

 resolvers += Resolver.url(artifactory, url(
 http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases
 ))(Resolver.ivyStylePatterns)

 addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)


 App1.scala

 import org.apache.spark.SparkConf
 import org.apache.spark.rdd.RDD
 import org.apache.spark.{Logging, SparkConf, SparkContext}
 import org.apache.spark.SparkContext._

 object App1 extends Logging {
   def main(args: Array[String]) = {
 val conf = new SparkConf().setAppName(App1)
 val sc = new SparkContext(conf)
 println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
   }
 }



 sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4

 Is it possible to force 3.2.11 version usage?

 Thanks,
 Alexey





Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it
works only if I remove extends Logging from the object, with extends
Logging it return:

Exception in thread main java.lang.LinkageError: loader constraint
violation in interface itable initialization: when resolving method
App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader
(instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current
class, App1$, and the class loader (instance of
sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging
have different Class objects for the type scala/Function0 used in the
signature
at App1.main(App1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Do you have any idea what's wrong with Logging?

PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf
spark.driver.userClassPathFirst=true --conf
spark.executor.userClassPathFirst=true
$HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar

Thanks,
Alexey


On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote:

 You could build a far jar for your application containing both your
 code and the json4s library, and then run Spark with these two
 options:

   spark.driver.userClassPathFirst=true
   spark.executor.userClassPathFirst=true

 Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but
 that only works for executors.)


 On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev
 alexey.zinov...@gmail.com wrote:
  Spark has a dependency on json4s 3.2.10, but this version has several
 bugs
  and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
  build.sbt and everything compiled fine. But when I spark-submit my JAR it
  provides me with 3.2.10.
 
 
  build.sbt
 
  import sbt.Keys._
 
  name := sparkapp
 
  version := 1.0
 
  scalaVersion := 2.10.4
 
  libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
  provided
 
  libraryDependencies += org.json4s %% json4s-native % 3.2.11`
 
 
  plugins.sbt
 
  logLevel := Level.Warn
 
  resolvers += Resolver.url(artifactory,
  url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases
 ))(Resolver.ivyStylePatterns)
 
  addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)
 
 
  App1.scala
 
  import org.apache.spark.SparkConf
  import org.apache.spark.rdd.RDD
  import org.apache.spark.{Logging, SparkConf, SparkContext}
  import org.apache.spark.SparkContext._
 
  object App1 extends Logging {
def main(args: Array[String]) = {
  val conf = new SparkConf().setAppName(App1)
  val sc = new SparkContext(conf)
  println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
}
  }
 
 
 
  sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4
 
  Is it possible to force 3.2.11 version usage?
 
  Thanks,
  Alexey



 --
 Marcelo



Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Spark has a dependency on json4s 3.2.10, but this version has several bugs
and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
build.sbt and everything compiled fine. But when I spark-submit my JAR it
provides me with 3.2.10.


build.sbt

import sbt.Keys._

name := sparkapp

version := 1.0

scalaVersion := 2.10.4

libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
provided

libraryDependencies += org.json4s %% json4s-native % 3.2.11`


plugins.sbt

logLevel := Level.Warn

resolvers += Resolver.url(artifactory, url(
http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases
))(Resolver.ivyStylePatterns)

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)


App1.scala

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.SparkContext._

object App1 extends Logging {
  def main(args: Array[String]) = {
val conf = new SparkConf().setAppName(App1)
val sc = new SparkContext(conf)
println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
  }
}



sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4

Is it possible to force 3.2.11 version usage?

Thanks,
Alexey


Re: Mathematical functions in spark sql

2015-01-26 Thread Alexey Romanchuk
I have tried select ceil(2/3), but got key not found: floor

On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you tried floor() or ceil() functions ?

 According to http://spark.apache.org/sql/, Spark SQL is compatible with
 Hive SQL.

 Cheers

 On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote:

 Hello everyone!

 I try execute select 2/3 and I get 0.. Is there any
 way
 to cast double to int or something similar?

 Also it will be cool to get list of functions supported by spark sql.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-02 Thread Alexey Romanchuk
Any ideas? Anyone got the same error?

On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk alexey.romanc...@gmail.com
 wrote:

 Hello spark users!

 I found lots of strange messages in driver log. Here it is:

 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
 ERROR
 akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
 - AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] -
 [akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
 akka.tcp://sparkExecutor@data1.hadoop:17372] [
 akka.remote.ShutDownAssociation: Shut down address:
 akka.tcp://sparkExecutor@data1.hadoop:17372
 Caused by: akka.remote.transport.Transport$InvalidAssociationException:
 The remote system terminated the association because it is shutting down.
 ]

 I got this message for every worker twice. First - for driverPropsFetcher
 and next for sparkExecutor. Looks like spark shutdown remote akka system
 incorrectly or there is some race condition in this process and driver sent
 some data to worker, but worker's actor system already in shutdown state.

 Except for this message everything works fine. But this is ERROR level
 message and I found it in my ERROR only log.

 Do you have any idea is it configuration issue, bug in spark or akka or
 something else?

 Thanks!




akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-01 Thread Alexey Romanchuk
Hello spark users!

I found lots of strange messages in driver log. Here it is:

2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
ERROR
akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
- AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] -
[akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372] [
akka.remote.ShutDownAssociation: Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The
remote system terminated the association because it is shutting down.
]

I got this message for every worker twice. First - for driverPropsFetcher
and next for sparkExecutor. Looks like spark shutdown remote akka system
incorrectly or there is some race condition in this process and driver sent
some data to worker, but worker's actor system already in shutdown state.

Except for this message everything works fine. But this is ERROR level
message and I found it in my ERROR only log.

Do you have any idea is it configuration issue, bug in spark or akka or
something else?

Thanks!


Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hello spark users and developers!

I am using hdfs + spark sql + hive schema + parquet as storage format. I
have lot of parquet files - one files fits one hdfs block for one day. The
strange thing is very slow first query for spark sql.

To reproduce situation I use only one core and I have 97sec for first time
and only 13sec for all next queries. Sure I query for different data, but
it has same structure and size. The situation can be reproduced after
restart thrift server.

Here it information about parquet files reading from worker node:

Slow one:
Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1560251 records from 30 columns in 11686 ms:
133.51454 rec/ms, 4005.4363 cell/ms

Fast one:
Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1373 ms:
1142.6796 rec/ms, 1142.6796 cell/ms

As you can see second reading is 10x times faster then first. Most of the
query time spent to work with parquet file.

This problem is really annoying, because most of my spark task contains
just 1 sql query and data processing and to speedup my jobs I put special
warmup query in from of any job.

My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance problem?

Thanks for advice!

p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation
but can not figure out what are important and what are not.


Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hey Sean and spark users!

Thanks for reply. I try -Xcomp right now and start time was about few
minutes (as expected), but I got first query slow as before:
Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 30 columns in 12897 ms:
121.64837 rec/ms, 3649.451 cell/ms

and next

Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1757 ms:
892.94196 rec/ms, 892.94196 cell/ms

I have no idea about caching or other stuff because CPU load is 100% on
worker and jstack show that worker is reading from parquet file.

Any ideas?

Thanks!

On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen so...@cloudera.com wrote:

 You could try setting -Xcomp for executors to force JIT compilation
 upfront. I don't know if it's a good idea overall but might show
 whether the upfront compilation really helps. I doubt it.

 However is this almost surely due to caching somewhere, in Spark SQL
 or HDFS? I really doubt hotspot makes a difference compared to these
 much larger factors.

 On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk
 alexey.romanc...@gmail.com wrote:
  Hello spark users and developers!
 
  I am using hdfs + spark sql + hive schema + parquet as storage format. I
  have lot of parquet files - one files fits one hdfs block for one day.
 The
  strange thing is very slow first query for spark sql.
 
  To reproduce situation I use only one core and I have 97sec for first
 time
  and only 13sec for all next queries. Sure I query for different data,
 but it
  has same structure and size. The situation can be reproduced after
 restart
  thrift server.
 
  Here it information about parquet files reading from worker node:
 
  Slow one:
  Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
  Assembled and processed 1560251 records from 30 columns in 11686 ms:
  133.51454 rec/ms, 4005.4363 cell/ms
 
  Fast one:
  Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
  Assembled and processed 1568899 records from 1 columns in 1373 ms:
 1142.6796
  rec/ms, 1142.6796 cell/ms
 
  As you can see second reading is 10x times faster then first. Most of the
  query time spent to work with parquet file.
 
  This problem is really annoying, because most of my spark task contains
 just
  1 sql query and data processing and to speedup my jobs I put special
 warmup
  query in from of any job.
 
  My assumption is that it is hotspot optimizations that used due first
  reading. Do you have any idea how to confirm/solve this performance
 problem?
 
  Thanks for advice!
 
  p.s. I have billion hotspot optimization showed with
 -XX:+PrintCompilation
  but can not figure out what are important and what are not.



Re: Log hdfs blocks sending

2014-09-26 Thread Alexey Romanchuk
Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log - https://gist.github.com/13h3r/6e5053cf0dbe33f2

Do you have any idea where to look at?

Thanks!

On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Alexey,

 You should see in the logs a locality measure like NODE_LOCAL,
 PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
 on them and you're reading out of HDFS, then you should be seeing almost
 all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
 uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
 think the data is local and does remote reads which really kills
 performance.

 Hope that helps!
 Andrew

 On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk 
 alexey.romanc...@gmail.com wrote:

 Hello again spark users and developers!

 I have standalone spark cluster (1.1.0) and spark sql running on it. My
 cluster consists of 4 datanodes and replication factor of files is 3.

 I use thrift server to access spark sql and have 1 table with 30+
 partitions. When I run query on whole table (something simple like select
 count(*) from t) spark produces a lot of network activity filling all
 available 1gb link. Looks like spark sent data by network instead of local
 reading.

 Is it any way to log which blocks were accessed locally and which are not?

 Thanks!





Log hdfs blocks sending

2014-09-25 Thread Alexey Romanchuk
Hello again spark users and developers!

I have standalone spark cluster (1.1.0) and spark sql running on it. My
cluster consists of 4 datanodes and replication factor of files is 3.

I use thrift server to access spark sql and have 1 table with 30+
partitions. When I run query on whole table (something simple like select
count(*) from t) spark produces a lot of network activity filling all
available 1gb link. Looks like spark sent data by network instead of local
reading.

Is it any way to log which blocks were accessed locally and which are not?

Thanks!