date:20160527

Hi Teng,


what version of spark are using as the execution engine. are you using a
vendor's product here?

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 13:05, Teng Qiu  wrote:

> I agree with Koert and Reynold, spark works well with large dataset now.
>
> back to the original discussion, compare SparkSQL vs Hive in Spark vs
> Spark API.
>
> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
> SparkSQL is pure SQL, and Spark API is language for writing stored
> procedure
>
> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
> so as a language, i would say they are almost the same.
>
> but Hive on Spark has a much better support for hive features,
> especially hiveserver2 and security features, hive features in
> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
> work with hivevar and hiveconf argument anymore, and the username for
> login via jdbc doesn't work either...
> see https://issues.apache.org/jira/browse/SPARK-13983
>
> i believe hive support in spark project is really very low priority
> stuff...
>
> sadly Hive on spark integration is not that easy, there are a lot of
> dependency conflicts... such as
> https://issues.apache.org/jira/browse/HIVE-13301
>
> our requirement is using spark with hiveserver2 in a secure way (with
> authentication and authorization), currently SparkSQL alone can not
> provide this, we are using ranger/sentry + Hive on Spark.
>
> hope this can help you to get a better idea which direction you should go.
>
> Cheers,
>
> Teng
>
>
> 2016-05-27 2:36 GMT+02:00 Koert Kuipers :
> > We do disk-to-disk iterative algorithms in spark all the time, on
> datasets
> > that do not fit in memory, and it works well for us. I usually have to do
> > some tuning of number of partitions for a new dataset but that's about
> it in
> > terms of inconveniences.
> >
> > On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
> >
> >
> > Spark can handle this true, but it is optimized for the idea that it
> works
> > it works on the same full dataset in-memory due to the underlying nature
> of
> > machine learning algorithms (iterative). Of course, you can spill over,
> but
> > that you should avoid.
> >
> > That being said you should have read my final sentence about this. Both
> > systems develop and change.
> >
> >
> > On 25 May 2016, at 22:14, Reynold Xin  wrote:
> >
> >
> > On Wed, May 25, 2016 at 9:52 AM, Jörn Franke 
> wrote:
> >>
> >> Spark is more for machine learning working iteravely over the whole same
> >> dataset in memory. Additionally it has streaming and graph processing
> >> capabilities that can be used together.
> >
> >
> > Hi Jörn,
> >
> > The first part is actually no true. Spark can handle data far greater
> than
> > the aggregate memory available on a cluster. The more recent versions
> (1.3+)
> > of Spark have external operations for almost all built-in operators, and
> > while things may not be perfect, those external operators are becoming
> more
> > and more robust with each version of Spark.
> >
> >
> >
> >
> >
>

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-27 Thread Aaron Ilovici

Mohammed,

The Spark Connector for Vertica is still in Beta and while that is still an 
option I would prefer native support from Spark. Considering all data types 
seem to map with the aggregated dialect except for NULL types, I imagine the 
work involved would be relatively minimal. I would be happy to code it out and 
submit a pull request, but I a question about the dialect:


-  Are NULL data types implicitly defined somewhere? I don’t see NULL 
cases in the other dialects.

I have come up with answers to the other questions below, and found 
Java->Vertica data type conversions. The only piece I am missing is the NULL 
value, which is the root of the necessity to have this dialect in the first 
place.

Reynold,

I agree. Vertica seemed to have a MVP GA, then make it performant in later 
releases, so I doubt the performance loss would be drastic.

Thank you for any help,


AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image001.png@01D1B7F9.A3949B20]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com



From: Reynold Xin 
Date: Thursday, May 26, 2016 at 6:11 PM
To: Mohammed Guller 
Cc: Aaron Ilovici , "user@spark.apache.org" 
, "d...@spark.apache.org" 
Subject: Re: JDBC Dialect for saving DataFrame into Vertica Table

It's probably a good idea to have the vertica dialect too, since it doesn't 
seem like it'd be too difficult to maintain. It is not going to be as 
performant as the native Vertica data source, but is going to be much lighter 
weight.


On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller 
> wrote:
Vertica also provides a Spark connector. It was not GA the last time I looked 
at it, but available on the Vertica community site. Have you tried using the 
Vertica Spark connector instead of the JDBC driver?

Mohammed
Author: Big Data Analytics with 
Spark

From: Aaron Ilovici [mailto:ailov...@wayfair.com]
Sent: Thursday, May 26, 2016 8:08 AM
To: user@spark.apache.org; 
d...@spark.apache.org
Subject: JDBC Dialect for saving DataFrame into Vertica Table

I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's 
jdbc function in the following manner:

dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);

This works when there are no NULL values in any of the Rows in my DataFrame. 
However, when there are rows, I get the following error:

ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver not 
capable.
at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown 
Source)
at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)

This appears to be Spark's attempt to set a null value in a PreparedStatement, 
but Vertica does not understand the type upon executing the transaction. I see 
in JdbcDialects.scala that there are dialects for MySQL, Postgres, DB2, 
MsSQLServer, Derby, and Oracle.

1 - Would writing a dialect for Vertica eleviate the issue, by setting a 'NULL' 
in a type that Vertica would understand?
2 - What would be the best way to do this without a Spark patch? Scala, Java, 
make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)' once created?
3 - Where would one find the proper mapping between Spark DataTypes and Vertica 
DataTypes? I don't see 'NULL' handling for any of the dialects, only the base 
case 'case _ => None' - is None mapped to the proper NULL type elsewhere?

My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7

I would be happy to create a Jira and submit a pull request with the 
VerticaDialect once I figure this out.

Thank you for any insight on this,

AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image002.png@01D1B7F9.A3949B20]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com

Python memory included YARN-monitored memory?

2016-05-27 Thread Mike Sukmanowsky

Hi everyone,

More of a YARN/OS question than a Spark one, but would be good to clarify
this on the docs somewhere once I get an answer.

We use PySpark for all our Spark applications running on EMR. Like many
users, we're accustomed to seeing the occasional ExecutorLostFailure after
YARN kills a container using more memory than it was allocated.

We're beginning to tune spark.yarn.executor.memoryOverhead, but before
messing around with that I wanted to check if YARN is monitoring the memory
usage of both the executor JVM and the spawned pyspark.daemon process or
just the JVM? Inspecting things on one of the YARN nodes would seem to
indicate this isn't the case since the spawned daemon gets a separate
process ID and process group, but I wanted to check to confirm as it could
make a big difference to pyspark users hoping to tune things.

Thanks,
Mike

Re: Pros and Cons

ah, yes, the version is another mess!... no vendor's product

i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.

hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
from hive side https://issues.apache.org/jira/browse/HIVE-13301

the jackson-databind lib from calcite-avatica.jar is too old.

will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released.


2016-05-27 16:16 GMT+02:00 Mich Talebzadeh :
> Hi Teng,
>
>
> what version of spark are using as the execution engine. are you using a
> vendor's product here?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 27 May 2016 at 13:05, Teng Qiu  wrote:
>>
>> I agree with Koert and Reynold, spark works well with large dataset now.
>>
>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
>> Spark API.
>>
>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
>> SparkSQL is pure SQL, and Spark API is language for writing stored
>> procedure
>>
>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
>> so as a language, i would say they are almost the same.
>>
>> but Hive on Spark has a much better support for hive features,
>> especially hiveserver2 and security features, hive features in
>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
>> work with hivevar and hiveconf argument anymore, and the username for
>> login via jdbc doesn't work either...
>> see https://issues.apache.org/jira/browse/SPARK-13983
>>
>> i believe hive support in spark project is really very low priority
>> stuff...
>>
>> sadly Hive on spark integration is not that easy, there are a lot of
>> dependency conflicts... such as
>> https://issues.apache.org/jira/browse/HIVE-13301
>>
>> our requirement is using spark with hiveserver2 in a secure way (with
>> authentication and authorization), currently SparkSQL alone can not
>> provide this, we are using ranger/sentry + Hive on Spark.
>>
>> hope this can help you to get a better idea which direction you should go.
>>
>> Cheers,
>>
>> Teng
>>
>>
>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers :
>> > We do disk-to-disk iterative algorithms in spark all the time, on
>> > datasets
>> > that do not fit in memory, and it works well for us. I usually have to
>> > do
>> > some tuning of number of partitions for a new dataset but that's about
>> > it in
>> > terms of inconveniences.
>> >
>> > On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
>> >
>> >
>> > Spark can handle this true, but it is optimized for the idea that it
>> > works
>> > it works on the same full dataset in-memory due to the underlying nature
>> > of
>> > machine learning algorithms (iterative). Of course, you can spill over,
>> > but
>> > that you should avoid.
>> >
>> > That being said you should have read my final sentence about this. Both
>> > systems develop and change.
>> >
>> >
>> > On 25 May 2016, at 22:14, Reynold Xin  wrote:
>> >
>> >
>> > On Wed, May 25, 2016 at 9:52 AM, Jörn Franke 
>> > wrote:
>> >>
>> >> Spark is more for machine learning working iteravely over the whole
>> >> same
>> >> dataset in memory. Additionally it has streaming and graph processing
>> >> capabilities that can be used together.
>> >
>> >
>> > Hi Jörn,
>> >
>> > The first part is actually no true. Spark can handle data far greater
>> > than
>> > the aggregate memory available on a cluster. The more recent versions
>> > (1.3+)
>> > of Spark have external operations for almost all built-in operators, and
>> > while things may not be perfect, those external operators are becoming
>> > more
>> > and more robust with each version of Spark.
>> >
>> >
>> >
>> >
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JDBC Create Table

2016-05-27 Thread Andrés Ivaldi

Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to fail
because is not possible create new tables in SQLServer, I'm using
SaveMode.Overwrite as in 1.6.0 version

Any Idea

regards

-- 
Ing. Ivaldi Andres

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang

Spark MLlib Vector only supports data of double type, it's reasonable to
throw exception when you creating a Vector with element of unicode type.

2016-05-24 7:27 GMT-07:00 flyinggip :

> Hi there,
>
> I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
> dealing with a vector with a single element.
>
> Firstly, the 'dense' method says it can also take numpy.array. However the
> code uses 'if len(elements) == 1' and when a numpy.array has only one
> element its length is undefined and currently if calling dense() on a numpy
> array with one element the program crashes. Probably instead of using len()
> in the above if, size should be used.
>
> Secondly, after I managed to create a dense-Vectors object with only one
> element from unicode, it seems that its behaviour is unpredictable. For
> example,
>
> Vectors.dense(unicode("0.1"))
>
> will report an error.
>
> dense_vec = Vectors.dense(unicode("0.1"))
>
> will NOT report any error until you run
>
> dense_vec
>
> to check its value. And the following will be able to create a successful
> DataFrame:
>
> mylist = [(0, Vectors.dense(unicode("0.1")))]
> myrdd = sc.parallelize(mylist)
> mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"])
>
> However if the above unicode value is read from a text file (e.g., a csv
> file with 2 columns) then the DataFrame column corresponding to "Y" will be
> EMPTY:
>
> raw_data = sc.textFile(filename)
> split_data = raw_data.map(lambda line: line.split(','))
> parsed_data = split_data.map(lambda line: (int(line[0]),
> Vectors.dense(line[1])))
> mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"])
>
> It would be great if someone could share some ideas. Thanks a lot.
>
> f.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Logistic Regression in Spark Streaming

2016-05-27 Thread kundan kumar

Agree, we have logistic regression example.

I was looking for its counterpart to "StreamingLinearRegressionWithSGD".

On Fri, May 27, 2016 at 1:16 PM, Alonso Isidoro Roman 
wrote:

> I do not have any experience using LR in spark, but you can see that LR is
> already implemented in mllib.
>
> http://spark.apache.org/docs/latest/mllib-linear-methods.html
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-05-27 9:09 GMT+02:00 kundan kumar :
>
>> Hi ,
>>
>> Do we have a streaming version of Logistic Regression in Spark ? I can
>> see its there for the Linear Regression.
>>
>> Has anyone used logistic regression on streaming data, it would be really
>> helpful if you share your insights on how to train the incoming data.
>>
>> In my use case I am trying to use logistic regression for click through
>> rate prediction using spark. Reason to go for online streaming mode is we
>> have new advertisers and items coming and old items leaving.
>>
>> Any insights would be helpful.
>>
>>
>> Regards,
>> Kundan
>>
>>
>

unsubscribe

2016-05-27 Thread Vinoth Sankar

unsubscribe

DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors

2016-05-27 Thread nmoretto

Hello everyone, I am trying to compute the similarity between 550k objects
using the DIMSUM algorithm available in Spark 1.6.

The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge
instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each.

My input data is a 3.5GB CSV file hosted on AWS S3, which I use to build a
RowMatrix with 550k columns and 550k rows, passing sparse vectors as rows to
the RowMatrix constructor.

At every attempt I've made so far the application fails during the
/mapPartitionWithIndex/ stage of the /RowMatrix.columnSimilarities()/ method
(source code at 
https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587

 
) with YARN containers 1) exiting with /FAILURE/ due to an /OutOfMemory/
exception on Java heap space (thanks to Spark, apparently) or 2) terminated
by AM (and increasing /spark.yarn.executor.memoryOverhead/ as suggested
doesn't seem to work).

I tried and combined different approaches without noticing significant
improvements:
- setting AWS EMR maximizeResourceAllocation option to true (details at 
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

 
)
- increasing the number of partitions (via /spark.default.parallelism/, up
to 8000)
- increasing the driver and executor memory (respectively from default ~512M
/ ~5G to ~50G / ~15G)
- increasing YARN memory overhead (from default 10% up to 40% of driver and
executor memory, respectively)
- setting the DIMSUM threshold to 0.5 and 0.8 to reduce the number of
comparisons

Anyone has any idea about the possible cause(s) of these errors? Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DIMSUM-among-550k-objects-on-AWS-Elastic-Map-Reduce-fails-with-OOM-errors-tp27038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

GraphX Java API

2016-05-27 Thread Kumar, Abhishek (US - Bengaluru)

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar
Products & Services | iLab
Deloitte Consulting LLP
Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post, Yemlur, 
Bengaluru – 560037, Karnataka, India
Mobile: +91 7736795770
abhishekkuma...@deloitte.com | 
www.deloitte.com

Please consider the environment before printing.





This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1

pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

2016-05-27 Thread Andrew Vykhodtsev

Dear list,

I am trying to calculate sum and count on the same column:

user_id_books_clicks =
(sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet')
  .groupby('user_id')
  .agg({'is_booking':'count',
'is_booking':'sum'})
  .orderBy(fn.desc('count(user_id)'))
  .cache()
   )

If I do it like that, it only gives me one (last) aggregate -
sum(is_booking)

But if I change to .agg({'user_id':'count', 'is_booking':'sum'})  -  it
gives me both. I am on 1.6.1. Is it fixed in 2.+? Or should I report it to
JIRA?

problem about RDD map and then saveAsTextFile

2016-05-27 Thread Reminia Scarlet

Hi all:
 I’ve tried to execute something as below:

 result.map(transform).saveAsTextFile(hdfsAddress)

 Result is a RDD caluculated from mlilib algorithm.


I submit this to yarn, and after two attempts , the application failed.

But the exception in log is very missleading. It said  hdfsAddress already
exits.

Actually, the first attempt log showed that the exception is from the
calculation of

result. Though the attempt failed it created the file. And then attempt 2
began with

exception ‘file already exists’.


 Why was RDD calculation before already failed but also the file created?
That’s not so good I think.

Re: Pros and Cons

I agree with Koert and Reynold, spark works well with large dataset now.

back to the original discussion, compare SparkSQL vs Hive in Spark vs Spark API.

SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
SparkSQL is pure SQL, and Spark API is language for writing stored
procedure

Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
use spark as spark as execution engine, SparkSQL uses Hive's syntax,
so as a language, i would say they are almost the same.

but Hive on Spark has a much better support for hive features,
especially hiveserver2 and security features, hive features in
SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
work with hivevar and hiveconf argument anymore, and the username for
login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983

i believe hive support in spark project is really very low priority stuff...

sadly Hive on spark integration is not that easy, there are a lot of
dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301

our requirement is using spark with hiveserver2 in a secure way (with
authentication and authorization), currently SparkSQL alone can not
provide this, we are using ranger/sentry + Hive on Spark.

hope this can help you to get a better idea which direction you should go.

Cheers,

Teng

2016-05-27 2:36 GMT+02:00 Koert Kuipers :
> We do disk-to-disk iterative algorithms in spark all the time, on datasets
> that do not fit in memory, and it works well for us. I usually have to do
> some tuning of number of partitions for a new dataset but that's about it in
> terms of inconveniences.
>
> On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
>
>
> Spark can handle this true, but it is optimized for the idea that it works
> it works on the same full dataset in-memory due to the underlying nature of
> machine learning algorithms (iterative). Of course, you can spill over, but
> that you should avoid.
>
> That being said you should have read my final sentence about this. Both
> systems develop and change.
>
>
> On 25 May 2016, at 22:14, Reynold Xin  wrote:
>
>
> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke  wrote:
>>
>> Spark is more for machine learning working iteravely over the whole same
>> dataset in memory. Additionally it has streaming and graph processing
>> capabilities that can be used together.
>
>
> Hi Jörn,
>
> The first part is actually no true. Spark can handle data far greater than
> the aggregate memory available on a cluster. The more recent versions (1.3+)
> of Spark have external operations for almost all built-in operators, and
> while things may not be perfect, those external operators are becoming more
> and more robust with each version of Spark.
>
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: problem about RDD map and then saveAsTextFile

2016-05-27 Thread Christian Hellström

Internally, saveAsTextFile uses saveAsHadoopFile:
https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
.

The final bit in the method first creates the output path and then saves
the data set. However, if there is an issue with the saveAsHadoopDataset
call, the path still remains. Technically, we could add an
exception-handling section that removes the path in case of problems. I
think that would be a nice way of making sure that we don’t litter the FS
with empty files and directories in case of exceptions.

So, to your question: parameter to saveAsTextFile is a path (not a file)
and it has to be empty. Spark automatically names the files PART-N with N
the partition number. This follows immediately from the partitioning scheme
of the RDD itself.

The real problem is that there is a problem with the calculation. You might
want to fix that first. Just post the relevant bits from the log.
Hi all:
I’ve tried to execute something as below:

result.map(transform).saveAsTextFile(hdfsAddress)

Result is a RDD caluculated from mlilib algorithm.

I submit this to yarn, and after two attempts , the application failed.

But the exception in log is very missleading. It said hdfsAddress already
exits.

Actually, the first attempt log showed that the exception is from the
calculation of

result. Though the attempt failed it created the file. And then attempt 2
began with

exception ‘file already exists’.

Why was RDD calculation before already failed but also the file created?
That’s not so good I think.

Re: JDBC Create Table

2016-05-27 Thread Anthony May

Hi Andrés,

What error are you seeing? Can you paste the stack trace?

Anthony

On Fri, 27 May 2016 at 08:37 Andrés Ivaldi  wrote:

> Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to
> fail because is not possible create new tables in SQLServer, I'm using
> SaveMode.Overwrite as in 1.6.0 version
>
> Any Idea
>
> regards
>
> --
> Ing. Ivaldi Andres
>

I'm pretty sure this is a Dataset bug

Unfortunately I can't show exactly the data I'm using, but this is what I'm
seeing:

I have a case class 'Product' that represents a table in our database. I
load that data via sqlContext.read.format("jdbc").options(...).load.as[Product]
and register it in a temp table 'product'.

For testing, I created a new Dataset that has only 3 records in it:

val ts = sqlContext.sql("select * from product where product_catalog_id in
(1, 2, 3)").as[Product]

I also created another one using the same case class and data, but from a
sequence instead.

val ds: Dataset[Product] = Seq(
  Product(Some(1), ...),
  Product(Some(2), ...),
  Product(Some(3), ...)
).toDS

The spark shell tells me these are exactly the same type at this point, but
they don't behave the same.

ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
$"ts2.product_catalog_id")
ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
$"ds2.product_catalog_id")

Again, spark tells me these self joins return exactly the same type, but
when I do a .show on them, only the one created from a Seq works. The one
created by reading from the database throws this error:

org.apache.spark.sql.AnalysisException: cannot resolve
'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
...];

Is this a bug? Is there anyway to make the Dataset loaded from a table
behave like the one created from a sequence?

Thanks,
Tim

Spark_API_Copy_From_Edgenode

2016-05-27 Thread Ajay Chander

Hi Everyone,

   I have some data located on the EdgeNode. Right
now, the process I follow to copy the data from Edgenode to HDFS is through
a shellscript which resides on Edgenode. In Oozie I am using a SSH action
to execute the shell script on Edgenode which copies the data to HDFS.

  I was just wondering if there is any built in API
with in Spark to do this job. I want to read the data from Edgenode into
RDD using JavaSparkContext then do saveAsTextFile("hdfs://...").
JavaSparkContext  does provide any method to pass Edgenode's access
credentials and read the data into an RDD ??

Thank you for your valuable time. Any pointers are appreciated.

Thank You,
Aj

Re: JDBC Create Table

are you using JDBC in spark shell

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 17:16, Anthony May  wrote:

> Hi Andrés,
>
> What error are you seeing? Can you paste the stack trace?
>
> Anthony
>
> On Fri, 27 May 2016 at 08:37 Andrés Ivaldi  wrote:
>
>> Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to
>> fail because is not possible create new tables in SQLServer, I'm using
>> SaveMode.Overwrite as in 1.6.0 version
>>
>> Any Idea
>>
>> regards
>>
>> --
>> Ing. Ivaldi Andres
>>
>

Re: I'm pretty sure this is a Dataset bug

I stand corrected. I just created a test table with a single int field to
test with and the Dataset loaded from that works with no issues. I'll see
if I can track down exactly what the difference might be.

On Fri, May 27, 2016 at 10:29 AM Tim Gautier  wrote:

> I'm using 1.6.1.
>
> I'm not sure what good fake data would do since it doesn't seem to have
> anything to do with the data itself. It has to do with how the Dataset was
> created. Both datasets have exactly the same data in them, but the one
> created from a sql query fails where the one created from a Seq works. The
> case class is just a few Option[Int] and Option[String] fields, nothing
> special.
>
> Obviously there's some sort of difference between the two datasets, but
> Spark tells me they're exactly the same type with exactly the same data, so
> I couldn't create a test case without actually accessing a sql database.
>
> On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:
>
>> Which release of Spark are you using ?
>>
>> Is it possible to come up with fake data that shows what you described ?
>>
>> Thanks
>>
>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
>> wrote:
>>
>>> Unfortunately I can't show exactly the data I'm using, but this is what
>>> I'm seeing:
>>>
>>> I have a case class 'Product' that represents a table in our database. I
>>> load that data via 
>>> sqlContext.read.format("jdbc").options(...).load.as[Product]
>>> and register it in a temp table 'product'.
>>>
>>> For testing, I created a new Dataset that has only 3 records in it:
>>>
>>> val ts = sqlContext.sql("select * from product where product_catalog_id
>>> in (1, 2, 3)").as[Product]
>>>
>>> I also created another one using the same case class and data, but from
>>> a sequence instead.
>>>
>>> val ds: Dataset[Product] = Seq(
>>>   Product(Some(1), ...),
>>>   Product(Some(2), ...),
>>>   Product(Some(3), ...)
>>> ).toDS
>>>
>>> The spark shell tells me these are exactly the same type at this point,
>>> but they don't behave the same.
>>>
>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>>> $"ts2.product_catalog_id")
>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>>> $"ds2.product_catalog_id")
>>>
>>> Again, spark tells me these self joins return exactly the same type, but
>>> when I do a .show on them, only the one created from a Seq works. The one
>>> created by reading from the database throws this error:
>>>
>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
>>> ...];
>>>
>>> Is this a bug? Is there anyway to make the Dataset loaded from a table
>>> behave like the one created from a sequence?
>>>
>>> Thanks,
>>> Tim
>>>
>>
>>

Re: I'm pretty sure this is a Dataset bug

I tried master branch :

scala> val testMapped = test.map(t => t.copy(id = t.id + 1))
testMapped: org.apache.spark.sql.Dataset[Test] = [id: int]

scala>  testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
t2.id").show
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`' given
input columns: [id];
  at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)


Suggest logging a JIRA if there is none logged.

On Fri, May 27, 2016 at 10:19 AM, Tim Gautier  wrote:

> Oops, screwed up my example. This is what it should be:
>
> case class Test(id: Int)
> val test = Seq(
>   Test(1),
>   Test(2),
>   Test(3)
> ).toDS
> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
> val testMapped = test.map(t => t.copy(id = t.id + 1))
> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id
> ").show
>
>
> On Fri, May 27, 2016 at 11:16 AM Tim Gautier 
> wrote:
>
>> I figured it out the trigger. Turns out it wasn't because I loaded it
>> from the database, it was because the first thing I do after loading is to
>> lower case all the strings. After a Dataset has been mapped, the resulting
>> Dataset can't be self joined. Here's a test case that illustrates the issue:
>>
>> case class Test(id: Int)
>> val test = Seq(
>>   Test(1),
>>   Test(2),
>>   Test(3)
>> ).toDS
>> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show //
>> <-- works fine
>> val testMapped = test.map(_.id + 1) // add 1 to each
>> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>> t2.id").show // <-- error
>>
>>
>> On Fri, May 27, 2016 at 10:44 AM Tim Gautier 
>> wrote:
>>
>>> I stand corrected. I just created a test table with a single int field
>>> to test with and the Dataset loaded from that works with no issues. I'll
>>> see if I can track down exactly what the difference might be.
>>>
>>> On Fri, May 27, 2016 at 10:29 AM Tim Gautier 
>>> wrote:
>>>
 I'm using 1.6.1.

 I'm not sure what good fake data would do since it doesn't seem to have
 anything to do with the data itself. It has to do with how the Dataset was
 created. Both datasets have exactly the same data in them, but the one
 created from a sql query fails where the one created from a Seq works. The
 case class is just a few Option[Int] and Option[String] fields, nothing
 special.

 Obviously there's some sort of difference between the two datasets, but
 Spark tells me they're exactly the same type with exactly the same data, so
 I couldn't create a test case without actually accessing a sql database.

 On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:

> Which release of Spark are you using ?
>
> Is it possible to come up with fake data that shows what you described
> ?
>
> Thanks
>
> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
> wrote:
>
>> Unfortunately I can't show exactly the data I'm using, but this is
>> what I'm seeing:
>>
>> I have a case class 'Product' that represents a table in our
>> database. I load that data via 
>> sqlContext.read.format("jdbc").options(...).
>> load.as[Product] and register it in a temp table 'product'.
>>
>> For testing, I created a new Dataset that has only 3 records in it:
>>
>> val ts = sqlContext.sql("select * from product where
>> product_catalog_id in (1, 2, 3)").as[Product]
>>
>> I also created another one using the same case class and data, but
>> from a sequence instead.
>>
>> val ds: Dataset[Product] = Seq(
>>   Product(Some(1), ...),
>>   Product(Some(2), ...),
>>   Product(Some(3), ...)
>> ).toDS
>>
>> The spark shell tells me these are exactly the same type at this
>> point, but they don't behave the same.
>>
>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>> $"ts2.product_catalog_id")
>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>> $"ds2.product_catalog_id")
>>
>> Again, spark tells me these self joins return exactly the same type,
>> but when I do a .show on them, only the one created

Re: I'm pretty sure this is a Dataset bug

I figured it out the trigger. Turns out it wasn't because I loaded it from
the database, it was because the first thing I do after loading is to lower
case all the strings. After a Dataset has been mapped, the resulting
Dataset can't be self joined. Here's a test case that illustrates the issue:

case class Test(id: Int)
val test = Seq(
  Test(1),
  Test(2),
  Test(3)
).toDS
test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show //
<-- works fine
val testMapped = test.map(_.id + 1) // add 1 to each
testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" ===
$"t2.id").show
// <-- error

On Fri, May 27, 2016 at 10:44 AM Tim Gautier  wrote:

> I stand corrected. I just created a test table with a single int field to
> test with and the Dataset loaded from that works with no issues. I'll see
> if I can track down exactly what the difference might be.
>
> On Fri, May 27, 2016 at 10:29 AM Tim Gautier 
> wrote:
>
>> I'm using 1.6.1.
>>
>> I'm not sure what good fake data would do since it doesn't seem to have
>> anything to do with the data itself. It has to do with how the Dataset was
>> created. Both datasets have exactly the same data in them, but the one
>> created from a sql query fails where the one created from a Seq works. The
>> case class is just a few Option[Int] and Option[String] fields, nothing
>> special.
>>
>> Obviously there's some sort of difference between the two datasets, but
>> Spark tells me they're exactly the same type with exactly the same data, so
>> I couldn't create a test case without actually accessing a sql database.
>>
>> On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:
>>
>>> Which release of Spark are you using ?
>>>
>>> Is it possible to come up with fake data that shows what you described ?
>>>
>>> Thanks
>>>
>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
>>> wrote:
>>>
 Unfortunately I can't show exactly the data I'm using, but this is what
 I'm seeing:

 I have a case class 'Product' that represents a table in our database.
 I load that data via sqlContext.read.format("jdbc").options(...).
 load.as[Product] and register it in a temp table 'product'.

 For testing, I created a new Dataset that has only 3 records in it:

 val ts = sqlContext.sql("select * from product where product_catalog_id
 in (1, 2, 3)").as[Product]

 I also created another one using the same case class and data, but from
 a sequence instead.

 val ds: Dataset[Product] = Seq(
   Product(Some(1), ...),
   Product(Some(2), ...),
   Product(Some(3), ...)
 ).toDS

 The spark shell tells me these are exactly the same type at this point,
 but they don't behave the same.

 ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
 $"ts2.product_catalog_id")
 ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
 $"ds2.product_catalog_id")

 Again, spark tells me these self joins return exactly the same type,
 but when I do a .show on them, only the one created from a Seq works. The
 one created by reading from the database throws this error:

 org.apache.spark.sql.AnalysisException: cannot resolve
 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
 ...];

 Is this a bug? Is there anyway to make the Dataset loaded from a table
 behave like the one created from a sequence?

 Thanks,
 Tim

>>>
>>>

Re: I'm pretty sure this is a Dataset bug

Oops, screwed up my example. This is what it should be:

case class Test(id: Int)
val test = Seq(
  Test(1),
  Test(2),
  Test(3)
).toDS
test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
val testMapped = test.map(t => t.copy(id = t.id + 1))
testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id
").show


On Fri, May 27, 2016 at 11:16 AM Tim Gautier  wrote:

> I figured it out the trigger. Turns out it wasn't because I loaded it from
> the database, it was because the first thing I do after loading is to lower
> case all the strings. After a Dataset has been mapped, the resulting
> Dataset can't be self joined. Here's a test case that illustrates the issue:
>
> case class Test(id: Int)
> val test = Seq(
>   Test(1),
>   Test(2),
>   Test(3)
> ).toDS
> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show //
> <-- works fine
> val testMapped = test.map(_.id + 1) // add 1 to each
> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === 
> $"t2.id").show
> // <-- error
>
>
> On Fri, May 27, 2016 at 10:44 AM Tim Gautier 
> wrote:
>
>> I stand corrected. I just created a test table with a single int field to
>> test with and the Dataset loaded from that works with no issues. I'll see
>> if I can track down exactly what the difference might be.
>>
>> On Fri, May 27, 2016 at 10:29 AM Tim Gautier 
>> wrote:
>>
>>> I'm using 1.6.1.
>>>
>>> I'm not sure what good fake data would do since it doesn't seem to have
>>> anything to do with the data itself. It has to do with how the Dataset was
>>> created. Both datasets have exactly the same data in them, but the one
>>> created from a sql query fails where the one created from a Seq works. The
>>> case class is just a few Option[Int] and Option[String] fields, nothing
>>> special.
>>>
>>> Obviously there's some sort of difference between the two datasets, but
>>> Spark tells me they're exactly the same type with exactly the same data, so
>>> I couldn't create a test case without actually accessing a sql database.
>>>
>>> On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:
>>>
 Which release of Spark are you using ?

 Is it possible to come up with fake data that shows what you described ?

 Thanks

 On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
 wrote:

> Unfortunately I can't show exactly the data I'm using, but this is
> what I'm seeing:
>
> I have a case class 'Product' that represents a table in our database.
> I load that data via sqlContext.read.format("jdbc").options(...).
> load.as[Product] and register it in a temp table 'product'.
>
> For testing, I created a new Dataset that has only 3 records in it:
>
> val ts = sqlContext.sql("select * from product where
> product_catalog_id in (1, 2, 3)").as[Product]
>
> I also created another one using the same case class and data, but
> from a sequence instead.
>
> val ds: Dataset[Product] = Seq(
>   Product(Some(1), ...),
>   Product(Some(2), ...),
>   Product(Some(3), ...)
> ).toDS
>
> The spark shell tells me these are exactly the same type at this
> point, but they don't behave the same.
>
> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
> $"ts2.product_catalog_id")
> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
> $"ds2.product_catalog_id")
>
> Again, spark tells me these self joins return exactly the same type,
> but when I do a .show on them, only the one created from a Seq works. The
> one created by reading from the database throws this error:
>
> org.apache.spark.sql.AnalysisException: cannot resolve
> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
> ...];
>
> Is this a bug? Is there anyway to make the Dataset loaded from a table
> behave like the one created from a sequence?
>
> Thanks,
> Tim
>

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Yong Zhang

I am not familiar with that particular piece of code. But the spark's 
concurrency comes from Multi-thread. One executor will use multi threads to 
process tasks, and these tasks share the JVM memory of the executor. So it 
won't be surprised that Spark needs some blocking/sync for the memory some 
places.
Yong

> Date: Fri, 27 May 2016 20:21:46 +0200
> Subject: Re: Not able to write output to local filsystem from Standalone mode.
> From: ja...@japila.pl
> To: java8...@hotmail.com
> CC: math...@closetwork.org; stutiawas...@hcl.com; user@spark.apache.org
> 
> Hi Yong,
> 
> It makes sense...almost. :) I'm not sure how relevant it is, but just
> today was reviewing BlockInfoManager code with the locks for reading
> and writing, and my understanding of the code shows that Spark if fine
> when there are multiple attempts for writes of new memory blocks
> (pages) with a mere synchronized code block. See
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L324-L325
> 
> With that, it's not that simple to say "that just makes sense".
> 
> p.s. The more I know the less things "just make sense to me".
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Fri, May 27, 2016 at 3:42 AM, Yong Zhang  wrote:
> > That just makes sense, doesn't it?
> >
> > The only place will be driver. If not, the executor will be having
> > contention by whom should create the directory in this case.
> >
> > Only the coordinator (driver in this case) is the best place for doing it.
> >
> > Yong
> >
> > 
> > From: math...@closetwork.org
> > Date: Wed, 25 May 2016 18:23:02 +
> > Subject: Re: Not able to write output to local filsystem from Standalone
> > mode.
> > To: ja...@japila.pl
> > CC: stutiawas...@hcl.com; user@spark.apache.org
> >
> >
> > Experience. I don't use Mesos or Yarn or Hadoop, so I don't know.
> >
> >
> > On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski  wrote:
> >
> > Hi Mathieu,
> >
> > Thanks a lot for the answer! I did *not* know it's the driver to
> > create the directory.
> >
> > You said "standalone mode", is this the case for the other modes -
> > yarn and mesos?
> >
> > p.s. Did you find it in the code or...just experienced before? #curious
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin 
> > wrote:
> >> In standalone mode, executor assume they have access to a shared file
> >> system. The driver creates the directory and the executor write files, so
> >> the executors end up not writing anything since there is no local
> >> directory.
> >>
> >> On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi 
> >> wrote:
> >>>
> >>> hi Jacek,
> >>>
> >>> Parent directory already present, its my home directory. Im using Linux
> >>> (Redhat) machine 64 bit.
> >>> Also I noticed that "test1" folder is created in my master with
> >>> subdirectory as "_temporary" which is empty. but on slaves, no such
> >>> directory is created under /home/stuti.
> >>>
> >>> Thanks
> >>> Stuti
> >>> 
> >>> From: Jacek Laskowski [ja...@japila.pl]
> >>> Sent: Tuesday, May 24, 2016 5:27 PM
> >>> To: Stuti Awasthi
> >>> Cc: user
> >>> Subject: Re: Not able to write output to local filsystem from Standalone
> >>> mode.
> >>>
> >>> Hi,
> >>>
> >>> What happens when you create the parent directory /home/stuti? I think
> >>> the
> >>> failure is due to missing parent directories. What's the OS?
> >>>
> >>> Jacek
> >>>
> >>> On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
> >>> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
> >>> shell , read the input file from local filesystem and perform
> >>> transformation
> >>> successfully. When I try to write my output in local filesystem path then
> >>> I
> >>> receive below error .
> >>>
> >>>
> >>>
> >>> I tried to search on web and found similar Jira :
> >>> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
> >>> resolved for Spark 1.3+ but already people have posted the same issue
> >>> still
> >>> persists in latest versions.
> >>>
> >>>
> >>>
> >>> ERROR
> >>>
> >>> scala> data.saveAsTextFile("/home/stuti/test1")
> >>>
> >>> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
> >>> server1): java.io.IOException: The temporary job-output directory
> >>> file:/home/stuti/test1/_temporary doesn't exist!
> >>>
> >>> at
> >>>
> >>>

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-27 Thread Govindasamy, Nagarajan

Hi Ted,


The link is useful. Still I could not figure out the way to convert the 
RDD[GenericRecord] in to DF.


Tried to create the spark sql schema from avro schema.

val json = """{"type":"record","name":"Profile","fields":

  [{"name":"userid","type":"string"},
  {"name":"created_time","type":"long"},
  {"name":"updated_time","type":"long"}]}"""

val schema: StructType = DataType.fromJson(json).asInstanceOf[StructType]
val profileDataFrame = sqlContext.createDataFrame(mergedProfiles, schema)



Getting the following compilation error:


[ERROR] ProfileService.scala:119: error: overloaded method value 
createDataFrame with alternatives:
[INFO]   (data: java.util.List[_],beanClass: 
Class[_])org.apache.spark.sql.DataFrame 
[INFO]   (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: 
Class[_])org.apache.spark.sql.DataFrame 
[INFO]   (rdd: org.apache.spark.rdd.RDD[_],beanClass: 
Class[_])org.apache.spark.sql.DataFrame 
[INFO]   (rows: java.util.List[org.apache.spark.sql.Row],schema: 
org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame 
[INFO]   (rowRDD: 
org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: 
org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame 
[INFO]   (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: 
org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
[INFO]  cannot be applied to 
(org.apache.spark.streaming.dstream.MapWithStateDStream[(String, String, 
String),org.apache.avro.generic.GenericRecord,org.apache.avro.generic.GenericRecord,((String,
 String, String), org.apache.avro.generic.GenericRecord)], 
org.apache.spark.sql.types.StructType)
[INFO] val profileDataFrame = sqlContext.createDataFrame(mergedProfiles, 
schema)


Thanks,


Raj



From: Ted Yu 
Sent: Thursday, May 26, 2016 12:01:43 PM
To: Govindasamy, Nagarajan
Cc: user@spark.apache.org
Subject: Re: save RDD of Avro GenericRecord as parquet throws 
UnsupportedOperationException

Have you seen this thread ?

http://search-hadoop.com/m/q3RTtWmyYB5fweR=Re+Best+way+to+store+Avro+Objects+as+Parquet+using+SPARK

On Thu, May 26, 2016 at 6:55 AM, Govindasamy, Nagarajan 
> wrote:

Hi,

I am trying to save RDD of Avro GenericRecord as parquet. I am using Spark 
1.6.1.

DStreamOfAvroGenericRecord.foreachRDD(rdd => 
rdd.toDF().write.parquet("s3://bucket/data.parquet"))

Getting the following exception. Is there a way to save Avro GenericRecord as 
Parquet or ORC file?

java.lang.UnsupportedOperationException: Schema for type 
org.apache.avro.generic.GenericRecord is not supported
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:690)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:689)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:689)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at 
org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)

Thanks,

Raj

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Jacek Laskowski

Hi Yong,

It makes sense...almost. :) I'm not sure how relevant it is, but just
today was reviewing BlockInfoManager code with the locks for reading
and writing, and my understanding of the code shows that Spark if fine
when there are multiple attempts for writes of new memory blocks
(pages) with a mere synchronized code block. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L324-L325

With that, it's not that simple to say "that just makes sense".

p.s. The more I know the less things "just make sense to me".

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, May 27, 2016 at 3:42 AM, Yong Zhang  wrote:
> That just makes sense, doesn't it?
>
> The only place will be driver. If not, the executor will be having
> contention by whom should create the directory in this case.
>
> Only the coordinator (driver in this case) is the best place for doing it.
>
> Yong
>
> 
> From: math...@closetwork.org
> Date: Wed, 25 May 2016 18:23:02 +
> Subject: Re: Not able to write output to local filsystem from Standalone
> mode.
> To: ja...@japila.pl
> CC: stutiawas...@hcl.com; user@spark.apache.org
>
>
> Experience. I don't use Mesos or Yarn or Hadoop, so I don't know.
>
>
> On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski  wrote:
>
> Hi Mathieu,
>
> Thanks a lot for the answer! I did *not* know it's the driver to
> create the directory.
>
> You said "standalone mode", is this the case for the other modes -
> yarn and mesos?
>
> p.s. Did you find it in the code or...just experienced before? #curious
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin 
> wrote:
>> In standalone mode, executor assume they have access to a shared file
>> system. The driver creates the directory and the executor write files, so
>> the executors end up not writing anything since there is no local
>> directory.
>>
>> On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi 
>> wrote:
>>>
>>> hi Jacek,
>>>
>>> Parent directory already present, its my home directory. Im using Linux
>>> (Redhat) machine 64 bit.
>>> Also I noticed that "test1" folder is created in my master with
>>> subdirectory as "_temporary" which is empty. but on slaves, no such
>>> directory is created under /home/stuti.
>>>
>>> Thanks
>>> Stuti
>>> 
>>> From: Jacek Laskowski [ja...@japila.pl]
>>> Sent: Tuesday, May 24, 2016 5:27 PM
>>> To: Stuti Awasthi
>>> Cc: user
>>> Subject: Re: Not able to write output to local filsystem from Standalone
>>> mode.
>>>
>>> Hi,
>>>
>>> What happens when you create the parent directory /home/stuti? I think
>>> the
>>> failure is due to missing parent directories. What's the OS?
>>>
>>> Jacek
>>>
>>> On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:
>>>
>>> Hi All,
>>>
>>> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
>>> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
>>> shell , read the input file from local filesystem and perform
>>> transformation
>>> successfully. When I try to write my output in local filesystem path then
>>> I
>>> receive below error .
>>>
>>>
>>>
>>> I tried to search on web and found similar Jira :
>>> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
>>> resolved for Spark 1.3+ but already people have posted the same issue
>>> still
>>> persists in latest versions.
>>>
>>>
>>>
>>> ERROR
>>>
>>> scala> data.saveAsTextFile("/home/stuti/test1")
>>>
>>> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
>>> server1): java.io.IOException: The temporary job-output directory
>>> file:/home/stuti/test1/_temporary doesn't exist!
>>>
>>> at
>>>
>>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>>>
>>> at
>>>
>>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
>>>
>>> at
>>>
>>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>>>
>>> at
>>> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
>>>
>>> at
>>>
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
>>>
>>> at
>>>
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
>>>
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>
>>> at

Re: Can not set spark dynamic resource allocation

2016-05-27 Thread Cui, Weifeng

Sorry to reply this late.


yarn.nodemanager.log-dirs
/local/output/logs/nm-log-dir
  

We do not use file://  in the settings, so that should not be the problem. Any 
other guesses?

Weifeng



On 5/20/16, 2:40 PM, "David Newberger"  wrote:

>Hi All,
>
>The error you are seeing looks really similar to Spark-13514 to me. I could be 
>wrong though
>
>https://issues.apache.org/jira/browse/SPARK-13514
>
>Can you check yarn.nodemanager.local-dirs  in your YARN configuration for 
>"file://"
>
>
>Cheers!
>David Newberger
>
>-Original Message-
>From: Cui, Weifeng [mailto:weife...@a9.com] 
>Sent: Friday, May 20, 2016 4:26 PM
>To: Marcelo Vanzin
>Cc: Ted Yu; Rodrick Brown; user; Zhao, Jun; Aulakh, Sahib; Song, Yiwei
>Subject: Re: Can not set spark dynamic resource allocation
>
>Sorry, here is the node-manager log. application_1463692924309_0002 is my 
>test. Hope this will help.
>http://pastebin.com/0BPEcgcW
>
>
>
>On 5/20/16, 2:09 PM, "Marcelo Vanzin"  wrote:
>
>>Hi Weifeng,
>>
>>That's the Spark event log, not the YARN application log. You get the 
>>latter using the "yarn logs" command.
>>
>>On Fri, May 20, 2016 at 1:14 PM, Cui, Weifeng  wrote:
>>> Here is the application log for this spark job.
>>>
>>> http://pastebin.com/2UJS9L4e
>>>
>>>
>>>
>>> Thanks,
>>> Weifeng
>>>
>>>
>>>
>>>
>>>
>>> From: "Aulakh, Sahib" 
>>> Date: Friday, May 20, 2016 at 12:43 PM
>>> To: Ted Yu 
>>> Cc: Rodrick Brown , Cui Weifeng 
>>> , user , "Zhao, Jun"
>>> 
>>> Subject: Re: Can not set spark dynamic resource allocation
>>>
>>>
>>>
>>> Yes it is yarn. We have configured spark shuffle service w yarn node 
>>> manager but something must be off.
>>>
>>>
>>>
>>> We will send u app log on paste bin.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On May 20, 2016, at 12:35 PM, Ted Yu  wrote:
>>>
>>> Since yarn-site.xml was cited, I assume the cluster runs YARN.
>>>
>>>
>>>
>>> On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown 
>>>  wrote:
>>>
>>> Is this Yarn or Mesos? For the later you need to start an external 
>>> shuffle service.
>>>
>>> Get Outlook for iOS
>>>
>>>
>>>
>>>
>>>
>>> On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" 
>>> 
>>> wrote:
>>>
>>> Hi guys,
>>>
>>>
>>>
>>> Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set 
>>> dynamic resource allocation for spark and we followed the following 
>>> link. After the changes, all spark jobs failed.
>>>
>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-reso
>>> urce-allocation
>>>
>>> This test was on a test cluster which has 1 master machine (running 
>>> namenode, resourcemanager and hive server), 1 worker machine (running 
>>> datanode and nodemanager) and 1 machine as client( running spark shell).
>>>
>>>
>>>
>>> What I updated in config :
>>>
>>>
>>>
>>> 1. Update in spark-defaults.conf
>>>
>>> spark.dynamicAllocation.enabled true
>>>
>>> spark.shuffle.service.enabledtrue
>>>
>>>
>>>
>>> 2. Update yarn-site.xml
>>>
>>> 
>>>
>>>  yarn.nodemanager.aux-services
>>>   mapreduce_shuffle,spark_shuffle
>>> 
>>>
>>> 
>>> yarn.nodemanager.aux-services.spark_shuffle.class
>>> org.apache.spark.network.yarn.YarnShuffleService
>>> 
>>>
>>> 
>>> spark.shuffle.service.enabled
>>>  true
>>> 
>>>
>>> 3. Copy  spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath
>>> ($HADOOP_HOME/share/hadoop/yarn/*) in python code
>>>
>>> 4. Restart namenode, datanode, resourcemanager, nodemanger... retart 
>>> everything
>>>
>>> 5. The config will update in all machines, resourcemanager and nodemanager.
>>> We update the config in one place and copy to all machines.
>>>
>>>
>>>
>>> What I tested:
>>>
>>>
>>>
>>> 1. I started a scala spark shell and check its environment variables, 
>>> spark.dynamicAllocation.enabled is true.
>>>
>>> 2. I used the following code:
>>>
>>> scala > val line =
>>> sc.textFile("/spark-events/application_1463681113470_0006")
>>>
>>> line: org.apache.spark.rdd.RDD[String] =
>>> /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at 
>>> textFile at :27
>>>
>>> scala > line.count # This command just stuck here
>>>
>>>
>>>
>>> 3. In the beginning, there is only 1 executor(this is for driver) and 
>>> after line.count, I could see 3 executors, then dropped to 1.
>>>
>>> 4. Several jobs were launched and all of them failed.   Tasks (for all
>>> stages): Succeeded/Total : 0/2 (4 failed)
>>>
>>>
>>>
>>> Error messages:
>>>
>>>
>>>
>>> I found the following messages in spark web UI. I found this in 
>>> spark.log on nodemanager machine as well.
>>>
>>>
>>>
>>> ExecutorLostFailure

Re: Undocumented left join constraint?

Which release did you use ?

I tried your example in master branch:

scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
test2: org.apache.spark.sql.Dataset[Test] = [id: int]

scala>  test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
"left_outer").show
+---+--+
| _1|_2|
+---+--+
|[1]|[null]|
|[2]|   [2]|
|[3]|   [3]|
+---+--+

On Fri, May 27, 2016 at 1:01 PM, Tim Gautier  wrote:

> Is it truly impossible to left join a Dataset[T] on the right if T has any
> non-option fields? It seems Spark tries to create Ts with null values in
> all fields when left joining, which results in null pointer exceptions. In
> fact, I haven't found any other way to get around this issue without making
> all fields in T options. Is there any other way?
>
> Example:
>
> case class Test(id: Int)
> val test1 = Seq(Test(1), Test(2), Test(3)).toDS
> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
> test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
> "left_outer").show
>
>

Re: Undocumented left join constraint?

When I run it in 1.6.1 I get this:

java.lang.RuntimeException: Error while decoding:
java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- root class: "$iwC.$iwC.Test"
If the schema is inferred from a Scala tuple/case class, or a Java bean,
please try to use scala.Option[_] or other nullable types (e.g.
java.lang.Integer instead of int/scala.Int).
newinstance(class scala.Tuple2,newinstance(class
$iwC$$iwC$Test,assertnotnull(input[0,
StructType(StructField(id,IntegerType,false))].id,- field (class:
"scala.Int", name: "id"),- root class:
"$iwC.$iwC.Test"),false,ObjectType(class
$iwC$$iwC$Test),Some($iwC$$iwC@6e40bddd)),newinstance(class
$iwC$$iwC$Test,assertnotnull(input[1,
StructType(StructField(id,IntegerType,true))].id,- field (class:
"scala.Int", name: "id"),- root class:
"$iwC.$iwC.Test"),false,ObjectType(class
$iwC$$iwC$Test),Some($iwC$$iwC@6e40bddd)),false,ObjectType(class
scala.Tuple2),None)
:- newinstance(class $iwC$$iwC$Test,assertnotnull(input[0,
StructType(StructField(id,IntegerType,false))].id,- field (class:
"scala.Int", name: "id"),- root class:
"$iwC.$iwC.Test"),false,ObjectType(class
$iwC$$iwC$Test),Some($iwC$$iwC@6e40bddd))
:  +- assertnotnull(input[0,
StructType(StructField(id,IntegerType,false))].id,- field (class:
"scala.Int", name: "id"),- root class: "$iwC.$iwC.Test")
: +- input[0, StructType(StructField(id,IntegerType,false))].id
:+- input[0, StructType(StructField(id,IntegerType,false))]
+- newinstance(class $iwC$$iwC$Test,assertnotnull(input[1,
StructType(StructField(id,IntegerType,true))].id,- field (class:
"scala.Int", name: "id"),- root class:
"$iwC.$iwC.Test"),false,ObjectType(class
$iwC$$iwC$Test),Some($iwC$$iwC@6e40bddd))
   +- assertnotnull(input[1,
StructType(StructField(id,IntegerType,true))].id,- field (class:
"scala.Int", name: "id"),- root class: "$iwC.$iwC.Test")
  +- input[1, StructType(StructField(id,IntegerType,true))].id
 +- input[1, StructType(StructField(id,IntegerType,true))]


On Fri, May 27, 2016 at 2:48 PM Tim Gautier  wrote:

> Interesting, I did that on 1.6.1, Scala 2.10
>
> On Fri, May 27, 2016 at 2:41 PM Ted Yu  wrote:
>
>> Which release did you use ?
>>
>> I tried your example in master branch:
>>
>> scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
>> test2: org.apache.spark.sql.Dataset[Test] = [id: int]
>>
>> scala>  test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
>> "left_outer").show
>> +---+--+
>> | _1|_2|
>> +---+--+
>> |[1]|[null]|
>> |[2]|   [2]|
>> |[3]|   [3]|
>> +---+--+
>>
>> On Fri, May 27, 2016 at 1:01 PM, Tim Gautier 
>> wrote:
>>
>>> Is it truly impossible to left join a Dataset[T] on the right if T has
>>> any non-option fields? It seems Spark tries to create Ts with null values
>>> in all fields when left joining, which results in null pointer exceptions.
>>> In fact, I haven't found any other way to get around this issue without
>>> making all fields in T options. Is there any other way?
>>>
>>> Example:
>>>
>>> case class Test(id: Int)
>>> val test1 = Seq(Test(1), Test(2), Test(3)).toDS
>>> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
>>> test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
>>> "left_outer").show
>>>
>>>
>>

Re: I'm pretty sure this is a Dataset bug

I'm working around it like this:

val testMapped2 = test1.rdd.map(t => t.copy(id = t.id + 1)).toDF.as[Test]
testMapped2.as("t1").joinWith(testMapped2.as("t2"), $"t1.id" === $"t2.id
").show

Switching from RDD, then mapping, then going back to DS seemed to avoid the
issue.

On Fri, May 27, 2016 at 3:10 PM Koert Kuipers  wrote:

> i am glad to see this, i think we can into this as well (in
> 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely.
>
> my observation was that joins of 2 datasets that were derived from the
> same datasource gave this kind of trouble. i changed my datasource from val
> to def (so it got created twice) as a workaround. the error did not occur
> with datasets created in unit test with sc.parallelize.
>
> On Fri, May 27, 2016 at 1:26 PM, Ted Yu  wrote:
>
>> I tried master branch :
>>
>> scala> val testMapped = test.map(t => t.copy(id = t.id + 1))
>> testMapped: org.apache.spark.sql.Dataset[Test] = [id: int]
>>
>> scala>  testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>> t2.id").show
>> org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`' given
>> input columns: [id];
>>   at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>   at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
>>   at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
>>   at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>>   at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>>   at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>>
>>
>> Suggest logging a JIRA if there is none logged.
>>
>> On Fri, May 27, 2016 at 10:19 AM, Tim Gautier 
>> wrote:
>>
>>> Oops, screwed up my example. This is what it should be:
>>>
>>> case class Test(id: Int)
>>> val test = Seq(
>>>   Test(1),
>>>   Test(2),
>>>   Test(3)
>>> ).toDS
>>> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
>>> val testMapped = test.map(t => t.copy(id = t.id + 1))
>>> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>>> t2.id").show
>>>
>>>
>>> On Fri, May 27, 2016 at 11:16 AM Tim Gautier 
>>> wrote:
>>>
 I figured it out the trigger. Turns out it wasn't because I loaded it
 from the database, it was because the first thing I do after loading is to
 lower case all the strings. After a Dataset has been mapped, the resulting
 Dataset can't be self joined. Here's a test case that illustrates the 
 issue:

 case class Test(id: Int)
 val test = Seq(
   Test(1),
   Test(2),
   Test(3)
 ).toDS
 test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
 // <-- works fine
 val testMapped = test.map(_.id + 1) // add 1 to each
 testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
 t2.id").show // <-- error


 On Fri, May 27, 2016 at 10:44 AM Tim Gautier 
 wrote:

> I stand corrected. I just created a test table with a single int field
> to test with and the Dataset loaded from that works with no issues. I'll
> see if I can track down exactly what the difference might be.
>
> On Fri, May 27, 2016 at 10:29 AM Tim Gautier 
> wrote:
>
>> I'm using 1.6.1.
>>
>> I'm not sure what good fake data would do since it doesn't seem to
>> have anything to do with the data itself. It has to do with how the 
>> Dataset
>> was created. Both datasets have exactly the same data in them, but the 
>> one
>> created from a sql query fails where the one created from a Seq works. 
>> The
>> case class is just a few Option[Int] and Option[String] fields, nothing
>> special.
>>
>> Obviously there's some sort of difference between the two datasets,
>> but Spark tells me they're exactly the same type with exactly the same
>> data, so I couldn't create a test case without actually accessing a sql
>> database.
>>
>> On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:
>>
>>> Which release of Spark are you using ?
>>>
>>> Is it possible to come up with fake data that shows what you
>>> described ?
>>>
>>> Thanks
>>>
>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
>>> wrote:
>>>
 Unfortunately I can't show exactly the data I'm using, but this is
 what I'm seeing:

 I have a case class 'Product' that represents a table in our
 database. I load that data via

Undocumented left join constraint?

Is it truly impossible to left join a Dataset[T] on the right if T has any
non-option fields? It seems Spark tries to create Ts with null values in
all fields when left joining, which results in null pointer exceptions. In
fact, I haven't found any other way to get around this issue without making
all fields in T options. Is there any other way?

Example:

case class Test(id: Int)
val test1 = Seq(Test(1), Test(2), Test(3)).toDS
val test2 = Seq(Test(2), Test(3), Test(4)).toDS
test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
"left_outer").show

Spark Streaming - Is window() caching DStreams?

2016-05-27 Thread Marco1982

Dear all,

Can someone please explain me how Spark Streaming executes the window()
operation? From the Spark 1.6.1 documentation, it seems that windowed
batches are automatically cached in memory, but looking at the web UI it
seems that operations already executed in previous batches are executed
again. For your convenience, I attach a screenshot of my running application
below.

By looking at the web UI, it seems that flatMapValues() RDDs are cached
(green spot - this is the last operation executed before I call window() on
the DStream), but, at the same time, it also seems that all the
transformations that led to flatMapValues() in previous batches are executed
again. If this is the case, the window() operation may induce huge
performance penalties, especially if the window duration is 1 or 2 hours (as
I expect for my application). Do you think that checkpointing the DStream at
that time can be helpful? Consider that the expected slide window is about 5
minutes.

Hope someone can clarify this point.

Thanks,
Marco
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Is-window-caching-DStreams-tp27041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Pros and Cons

yes, only for engine, but maybe newer version has more optimization
from tungsten project? at least since spark 1.6?


> -- Forwarded message --
> From: Mich Talebzadeh 
> Date: 27 May 2016 at 17:09
> Subject: Re: Pros and Cons
> To: Teng Qiu 
> Cc: Ted Yu , Koert Kuipers , Jörn
> Franke , user , Aakash Basu
> , Reynold Xin 
>
>
> not worth spending time really.
>
> The only version that works is Spark 1.3.1 with Hive 2
> To be perfectly honest deploying Spark as Hive engine only requires certain
> Spark capabilities that I think 1.3.1 is OK with it. Remember we are talking
> about engine not HQL etc. That is all provided by Hive itself.
>
> To me a gain in performance at least 2 compared to MR is perfectly
> acceptable.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 27 May 2016 at 16:58, Teng Qiu  wrote:
>>
>> tried spark 2.0.0 preview, but no assembly jar there... then just gave
>> up... :p
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Range partition for parquet file?

2016-05-27 Thread Rex Xiong

Hi,

I have a spark job output DataFrame which contains a column named Id, which
is a GUID string.
We will use Id to filter data in another spark application, so it should be
a partition key.

I found these two methods in Internet:

1.
DataFrame.write.save("Id") method will help, but the possible value space
for GUID is too big, I prefer to do a range partition to make it 100
partitions evenly.

2.
Another way is DataFrame.repartition("Id"), but the result seems to only
stay in memory, once it's saved, then loaded from another spark
application, we need to repartition it again?

After all, what is the relationship between Parquet partitions and
DataFrame.repartition?
E.g.
The parquet data is stored physically under /year=X/month=Y, I get this
data into DataFrame, then call DataFrame.repartition("Id").
Run this query:
df.filter("year=2016 and month=5 and Id='')
Will Parquet folder pruning still work? Or it's already been partitioned
into Id, so it needs to scan all year/month combinations?

Thanks

Re: Undocumented left join constraint?

Interesting, I did that on 1.6.1, Scala 2.10

On Fri, May 27, 2016 at 2:41 PM Ted Yu  wrote:

> Which release did you use ?
>
> I tried your example in master branch:
>
> scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
> test2: org.apache.spark.sql.Dataset[Test] = [id: int]
>
> scala>  test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
> "left_outer").show
> +---+--+
> | _1|_2|
> +---+--+
> |[1]|[null]|
> |[2]|   [2]|
> |[3]|   [3]|
> +---+--+
>
> On Fri, May 27, 2016 at 1:01 PM, Tim Gautier 
> wrote:
>
>> Is it truly impossible to left join a Dataset[T] on the right if T has
>> any non-option fields? It seems Spark tries to create Ts with null values
>> in all fields when left joining, which results in null pointer exceptions.
>> In fact, I haven't found any other way to get around this issue without
>> making all fields in T options. Is there any other way?
>>
>> Example:
>>
>> case class Test(id: Int)
>> val test1 = Seq(Test(1), Test(2), Test(3)).toDS
>> val test2 = Seq(Test(2), Test(3), Test(4)).toDS
>> test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id",
>> "left_outer").show
>>
>>
>

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Koert Kuipers

i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT)
but i couldn't reproduce it nicely.

my observation was that joins of 2 datasets that were derived from the same
datasource gave this kind of trouble. i changed my datasource from val to
def (so it got created twice) as a workaround. the error did not occur with
datasets created in unit test with sc.parallelize.

On Fri, May 27, 2016 at 1:26 PM, Ted Yu  wrote:

> I tried master branch :
>
> scala> val testMapped = test.map(t => t.copy(id = t.id + 1))
> testMapped: org.apache.spark.sql.Dataset[Test] = [id: int]
>
> scala>  testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
> t2.id").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`' given
> input columns: [id];
>   at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
>   at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>   at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>
>
> Suggest logging a JIRA if there is none logged.
>
> On Fri, May 27, 2016 at 10:19 AM, Tim Gautier 
> wrote:
>
>> Oops, screwed up my example. This is what it should be:
>>
>> case class Test(id: Int)
>> val test = Seq(
>>   Test(1),
>>   Test(2),
>>   Test(3)
>> ).toDS
>> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
>> val testMapped = test.map(t => t.copy(id = t.id + 1))
>> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>> t2.id").show
>>
>>
>> On Fri, May 27, 2016 at 11:16 AM Tim Gautier 
>> wrote:
>>
>>> I figured it out the trigger. Turns out it wasn't because I loaded it
>>> from the database, it was because the first thing I do after loading is to
>>> lower case all the strings. After a Dataset has been mapped, the resulting
>>> Dataset can't be self joined. Here's a test case that illustrates the issue:
>>>
>>> case class Test(id: Int)
>>> val test = Seq(
>>>   Test(1),
>>>   Test(2),
>>>   Test(3)
>>> ).toDS
>>> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
>>> // <-- works fine
>>> val testMapped = test.map(_.id + 1) // add 1 to each
>>> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>>> t2.id").show // <-- error
>>>
>>>
>>> On Fri, May 27, 2016 at 10:44 AM Tim Gautier 
>>> wrote:
>>>
 I stand corrected. I just created a test table with a single int field
 to test with and the Dataset loaded from that works with no issues. I'll
 see if I can track down exactly what the difference might be.

 On Fri, May 27, 2016 at 10:29 AM Tim Gautier 
 wrote:

> I'm using 1.6.1.
>
> I'm not sure what good fake data would do since it doesn't seem to
> have anything to do with the data itself. It has to do with how the 
> Dataset
> was created. Both datasets have exactly the same data in them, but the one
> created from a sql query fails where the one created from a Seq works. The
> case class is just a few Option[Int] and Option[String] fields, nothing
> special.
>
> Obviously there's some sort of difference between the two datasets,
> but Spark tells me they're exactly the same type with exactly the same
> data, so I couldn't create a test case without actually accessing a sql
> database.
>
> On Fri, May 27, 2016 at 10:15 AM Ted Yu  wrote:
>
>> Which release of Spark are you using ?
>>
>> Is it possible to come up with fake data that shows what you
>> described ?
>>
>> Thanks
>>
>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier 
>> wrote:
>>
>>> Unfortunately I can't show exactly the data I'm using, but this is
>>> what I'm seeing:
>>>
>>> I have a case class 'Product' that represents a table in our
>>> database. I load that data via 
>>> sqlContext.read.format("jdbc").options(...).
>>> load.as[Product] and register it in a temp table 'product'.
>>>
>>> For testing, I created a new Dataset that has only 3 records in it:
>>>
>>> val ts = sqlContext.sql("select * from product where
>>> product_catalog_id in (1, 2, 3)").as[Product]
>>>
>>> I also created another one using the same case class and data, but
>>> from a sequence instead.
>>>
>>> val ds: Dataset[Product] = Seq(

Re: Pros and Cons

Hi Ted,

do you mean Hive 2 with spark 2 snapshot build as the execution engine just
binaries for snapshot (all ok)?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 16:39, Ted Yu  wrote:

> Teng:
> Why not try out the 2.0 SANPSHOT build ?
>
> Thanks
>
> > On May 27, 2016, at 7:44 AM, Teng Qiu  wrote:
> >
> > ah, yes, the version is another mess!... no vendor's product
> >
> > i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> >
> > hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> > from hive side https://issues.apache.org/jira/browse/HIVE-13301
> >
> > the jackson-databind lib from calcite-avatica.jar is too old.
> >
> > will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0
> released.
> >
> >
> > 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh :
> >> Hi Teng,
> >>
> >>
> >> what version of spark are using as the execution engine. are you using a
> >> vendor's product here?
> >>
> >> thanks
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >>
> >>
> >>> On 27 May 2016 at 13:05, Teng Qiu  wrote:
> >>>
> >>> I agree with Koert and Reynold, spark works well with large dataset
> now.
> >>>
> >>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
> >>> Spark API.
> >>>
> >>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
> >>> SparkSQL is pure SQL, and Spark API is language for writing stored
> >>> procedure
> >>>
> >>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
> >>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
> >>> so as a language, i would say they are almost the same.
> >>>
> >>> but Hive on Spark has a much better support for hive features,
> >>> especially hiveserver2 and security features, hive features in
> >>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
> >>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
> >>> work with hivevar and hiveconf argument anymore, and the username for
> >>> login via jdbc doesn't work either...
> >>> see https://issues.apache.org/jira/browse/SPARK-13983
> >>>
> >>> i believe hive support in spark project is really very low priority
> >>> stuff...
> >>>
> >>> sadly Hive on spark integration is not that easy, there are a lot of
> >>> dependency conflicts... such as
> >>> https://issues.apache.org/jira/browse/HIVE-13301
> >>>
> >>> our requirement is using spark with hiveserver2 in a secure way (with
> >>> authentication and authorization), currently SparkSQL alone can not
> >>> provide this, we are using ranger/sentry + Hive on Spark.
> >>>
> >>> hope this can help you to get a better idea which direction you should
> go.
> >>>
> >>> Cheers,
> >>>
> >>> Teng
> >>>
> >>>
> >>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers :
>  We do disk-to-disk iterative algorithms in spark all the time, on
>  datasets
>  that do not fit in memory, and it works well for us. I usually have to
>  do
>  some tuning of number of partitions for a new dataset but that's about
>  it in
>  terms of inconveniences.
> 
>  On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
> 
> 
>  Spark can handle this true, but it is optimized for the idea that it
>  works
>  it works on the same full dataset in-memory due to the underlying
> nature
>  of
>  machine learning algorithms (iterative). Of course, you can spill
> over,
>  but
>  that you should avoid.
> 
>  That being said you should have read my final sentence about this.
> Both
>  systems develop and change.
> 
> 
>  On 25 May 2016, at 22:14, Reynold Xin  wrote:
> 
> 
>  On Wed, May 25, 2016 at 9:52 AM, Jörn Franke 
>  wrote:
> >
> > Spark is more for machine learning working iteravely over the whole
> > same
> > dataset in memory. Additionally it has streaming and graph processing
> > capabilities that can be used together.
> 
> 
>  Hi Jörn,
> 
>  The first part is actually no true. Spark can handle data far greater
>  than
>  the aggregate memory available on a cluster. The more recent versions
>  (1.3+)
>  of Spark have external operations for almost all built-in operators,
> and
>  while things may not be perfect, those external operators are becoming
>  more
>  and more robust with each version of Spark.
> >
> >

Fwd: Pros and Cons

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



-- Forwarded message --
From: Mich Talebzadeh 
Date: 27 May 2016 at 17:09
Subject: Re: Pros and Cons
To: Teng Qiu 
Cc: Ted Yu , Koert Kuipers , Jörn
Franke , user , Aakash Basu <
raj2coo...@gmail.com>, Reynold Xin 


not worth spending time really.

The only version that works is Spark 1.3.1 with Hive 2
To be perfectly honest deploying Spark as Hive engine only requires certain
Spark capabilities that I think 1.3.1 is OK with it. Remember we are
talking about engine not HQL etc. That is all provided by Hive itself.

To me a gain in performance at least 2 compared to MR is perfectly
acceptable.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 16:58, Teng Qiu  wrote:

> tried spark 2.0.0 preview, but no assembly jar there... then just gave
> up... :p
>
> 2016-05-27 17:39 GMT+02:00 Ted Yu :
> > Teng:
> > Why not try out the 2.0 SANPSHOT build ?
> >
> > Thanks
> >
> >> On May 27, 2016, at 7:44 AM, Teng Qiu  wrote:
> >>
> >> ah, yes, the version is another mess!... no vendor's product
> >>
> >> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> >>
> >> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> >> from hive side https://issues.apache.org/jira/browse/HIVE-13301
> >>
> >> the jackson-databind lib from calcite-avatica.jar is too old.
> >>
> >> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0
> released.
> >>
> >>
> >> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh :
> >>> Hi Teng,
> >>>
> >>>
> >>> what version of spark are using as the execution engine. are you using
> a
> >>> vendor's product here?
> >>>
> >>> thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>>
> >>>
>  On 27 May 2016 at 13:05, Teng Qiu  wrote:
> 
>  I agree with Koert and Reynold, spark works well with large dataset
> now.
> 
>  back to the original discussion, compare SparkSQL vs Hive in Spark vs
>  Spark API.
> 
>  SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
>  SparkSQL is pure SQL, and Spark API is language for writing stored
>  procedure
> 
>  Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
>  use spark as spark as execution engine, SparkSQL uses Hive's syntax,
>  so as a language, i would say they are almost the same.
> 
>  but Hive on Spark has a much better support for hive features,
>  especially hiveserver2 and security features, hive features in
>  SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
>  in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
>  work with hivevar and hiveconf argument anymore, and the username for
>  login via jdbc doesn't work either...
>  see https://issues.apache.org/jira/browse/SPARK-13983
> 
>  i believe hive support in spark project is really very low priority
>  stuff...
> 
>  sadly Hive on spark integration is not that easy, there are a lot of
>  dependency conflicts... such as
>  https://issues.apache.org/jira/browse/HIVE-13301
> 
>  our requirement is using spark with hiveserver2 in a secure way (with
>  authentication and authorization), currently SparkSQL alone can not
>  provide this, we are using ranger/sentry + Hive on Spark.
> 
>  hope this can help you to get a better idea which direction you
> should go.
> 
>  Cheers,
> 
>  Teng
> 
> 
>  2016-05-27 2:36 GMT+02:00 Koert Kuipers :
> > We do disk-to-disk iterative algorithms in spark all the time, on
> > datasets
> > that do not fit in memory, and it works well for us. I usually have
> to
> > do
> > some tuning of number of partitions for a new dataset but that's
> about
> > it in
> > terms of inconveniences.
> >
> > On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
> >
> >
> > Spark can handle this true, but it is optimized for the idea that it
> > works
> > it works on the same full dataset in-memory due to the underlying
> nature
> > of
> >

Re: I'm pretty sure this is a Dataset bug

Which release of Spark are you using ?

Is it possible to come up with fake data that shows what you described ?

Thanks

On Fri, May 27, 2016 at 8:24 AM, Tim Gautier  wrote:

> Unfortunately I can't show exactly the data I'm using, but this is what
> I'm seeing:
>
> I have a case class 'Product' that represents a table in our database. I
> load that data via 
> sqlContext.read.format("jdbc").options(...).load.as[Product]
> and register it in a temp table 'product'.
>
> For testing, I created a new Dataset that has only 3 records in it:
>
> val ts = sqlContext.sql("select * from product where product_catalog_id in
> (1, 2, 3)").as[Product]
>
> I also created another one using the same case class and data, but from a
> sequence instead.
>
> val ds: Dataset[Product] = Seq(
>   Product(Some(1), ...),
>   Product(Some(2), ...),
>   Product(Some(3), ...)
> ).toDS
>
> The spark shell tells me these are exactly the same type at this point,
> but they don't behave the same.
>
> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
> $"ts2.product_catalog_id")
> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
> $"ds2.product_catalog_id")
>
> Again, spark tells me these self joins return exactly the same type, but
> when I do a .show on them, only the one created from a Seq works. The one
> created by reading from the database throws this error:
>
> org.apache.spark.sql.AnalysisException: cannot resolve
> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
> ...];
>
> Is this a bug? Is there anyway to make the Dataset loaded from a table
> behave like the one created from a sequence?
>
> Thanks,
> Tim
>

Re: Pros and Cons

Teng:
Why not try out the 2.0 SANPSHOT build ?

Thanks

> On May 27, 2016, at 7:44 AM, Teng Qiu  wrote:
> 
> ah, yes, the version is another mess!... no vendor's product
> 
> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> 
> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> from hive side https://issues.apache.org/jira/browse/HIVE-13301
> 
> the jackson-databind lib from calcite-avatica.jar is too old.
> 
> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released.
> 
> 
> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh :
>> Hi Teng,
>> 
>> 
>> what version of spark are using as the execution engine. are you using a
>> vendor's product here?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> 
>> 
>>> On 27 May 2016 at 13:05, Teng Qiu  wrote:
>>> 
>>> I agree with Koert and Reynold, spark works well with large dataset now.
>>> 
>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
>>> Spark API.
>>> 
>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
>>> SparkSQL is pure SQL, and Spark API is language for writing stored
>>> procedure
>>> 
>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
>>> so as a language, i would say they are almost the same.
>>> 
>>> but Hive on Spark has a much better support for hive features,
>>> especially hiveserver2 and security features, hive features in
>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
>>> work with hivevar and hiveconf argument anymore, and the username for
>>> login via jdbc doesn't work either...
>>> see https://issues.apache.org/jira/browse/SPARK-13983
>>> 
>>> i believe hive support in spark project is really very low priority
>>> stuff...
>>> 
>>> sadly Hive on spark integration is not that easy, there are a lot of
>>> dependency conflicts... such as
>>> https://issues.apache.org/jira/browse/HIVE-13301
>>> 
>>> our requirement is using spark with hiveserver2 in a secure way (with
>>> authentication and authorization), currently SparkSQL alone can not
>>> provide this, we are using ranger/sentry + Hive on Spark.
>>> 
>>> hope this can help you to get a better idea which direction you should go.
>>> 
>>> Cheers,
>>> 
>>> Teng
>>> 
>>> 
>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers :
 We do disk-to-disk iterative algorithms in spark all the time, on
 datasets
 that do not fit in memory, and it works well for us. I usually have to
 do
 some tuning of number of partitions for a new dataset but that's about
 it in
 terms of inconveniences.
 
 On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
 
 
 Spark can handle this true, but it is optimized for the idea that it
 works
 it works on the same full dataset in-memory due to the underlying nature
 of
 machine learning algorithms (iterative). Of course, you can spill over,
 but
 that you should avoid.
 
 That being said you should have read my final sentence about this. Both
 systems develop and change.
 
 
 On 25 May 2016, at 22:14, Reynold Xin  wrote:
 
 
 On Wed, May 25, 2016 at 9:52 AM, Jörn Franke 
 wrote:
> 
> Spark is more for machine learning working iteravely over the whole
> same
> dataset in memory. Additionally it has streaming and graph processing
> capabilities that can be used together.
 
 
 Hi Jörn,
 
 The first part is actually no true. Spark can handle data far greater
 than
 the aggregate memory available on a cluster. The more recent versions
 (1.3+)
 of Spark have external operations for almost all built-in operators, and
 while things may not be perfect, those external operators are becoming
 more
 and more robust with each version of Spark.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Pros and Cons

tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p

2016-05-27 17:39 GMT+02:00 Ted Yu :
> Teng:
> Why not try out the 2.0 SANPSHOT build ?
>
> Thanks
>
>> On May 27, 2016, at 7:44 AM, Teng Qiu  wrote:
>>
>> ah, yes, the version is another mess!... no vendor's product
>>
>> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
>>
>> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
>> from hive side https://issues.apache.org/jira/browse/HIVE-13301
>>
>> the jackson-databind lib from calcite-avatica.jar is too old.
>>
>> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released.
>>
>>
>> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh :
>>> Hi Teng,
>>>
>>>
>>> what version of spark are using as the execution engine. are you using a
>>> vendor's product here?
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
 On 27 May 2016 at 13:05, Teng Qiu  wrote:

 I agree with Koert and Reynold, spark works well with large dataset now.

 back to the original discussion, compare SparkSQL vs Hive in Spark vs
 Spark API.

 SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
 SparkSQL is pure SQL, and Spark API is language for writing stored
 procedure

 Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
 use spark as spark as execution engine, SparkSQL uses Hive's syntax,
 so as a language, i would say they are almost the same.

 but Hive on Spark has a much better support for hive features,
 especially hiveserver2 and security features, hive features in
 SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
 in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
 work with hivevar and hiveconf argument anymore, and the username for
 login via jdbc doesn't work either...
 see https://issues.apache.org/jira/browse/SPARK-13983

 i believe hive support in spark project is really very low priority
 stuff...

 sadly Hive on spark integration is not that easy, there are a lot of
 dependency conflicts... such as
 https://issues.apache.org/jira/browse/HIVE-13301

 our requirement is using spark with hiveserver2 in a secure way (with
 authentication and authorization), currently SparkSQL alone can not
 provide this, we are using ranger/sentry + Hive on Spark.

 hope this can help you to get a better idea which direction you should go.

 Cheers,

 Teng

 2016-05-27 2:36 GMT+02:00 Koert Kuipers :
> We do disk-to-disk iterative algorithms in spark all the time, on
> datasets
> that do not fit in memory, and it works well for us. I usually have to
> do
> some tuning of number of partitions for a new dataset but that's about
> it in
> terms of inconveniences.
>
> On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:
>
>
> Spark can handle this true, but it is optimized for the idea that it
> works
> it works on the same full dataset in-memory due to the underlying nature
> of
> machine learning algorithms (iterative). Of course, you can spill over,
> but
> that you should avoid.
>
> That being said you should have read my final sentence about this. Both
> systems develop and change.
>
>
> On 25 May 2016, at 22:14, Reynold Xin  wrote:
>
>
> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke 
> wrote:
>>
>> Spark is more for machine learning working iteravely over the whole
>> same
>> dataset in memory. Additionally it has streaming and graph processing
>> capabilities that can be used together.
>
>
> Hi Jörn,
>
> The first part is actually no true. Spark can handle data far greater
> than
> the aggregate memory available on a cluster. The more recent versions
> (1.3+)
> of Spark have external operations for almost all built-in operators, and
> while things may not be perfect, those external operators are becoming
> more
> and more robust with each version of Spark.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: I'm pretty sure this is a Dataset bug