Rmse recomender system

2017-05-20 Thread Arun

hi all..

I am new to machine learning.

i am working on recomender system. for training dataset RMSE is  0.08  while on 
test data its is 2.345

whats conclusion and what steps can i take to improve



Sent from Samsung tablet


Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Kabeer Ahmed
Thank you Takeshi.As far as I see from the code pointed, the default number of bytes to pack in a partition is set to 128MB - size of the parquet block size. Daniel,It seems you do have a need to modify the number of bytes you want to pack per partition. I am curious to know the scenario. Please share if you can. Thanks,Kabeer.

On May 20 2017, at 4:54 pm, Takeshi Yamamuro  wrote:


  I think this document points to a logic here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418This logic merge small files into a partition and you can control this threshold via `spark.sql.files.maxPartitionBytes`.// maropuOn Sat, May 20, 2017 at 8:15 AM, ayan guha  wrote:I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as your dsta filesOn Sat, 20 May 2017 at 5:12 am, Aakash Basu  wrote:Hey all,A reply on this would be great!Thanks,A.B.On 17-May-2017 1:43 AM, "Daniel Siegmann"  wrote:When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says (http://spark.apache.org/releases/spark-release-2-0-0.html):"Automatic file coalescing for native data sources"Can anyone point me to documentation explaining what triggers this feature, how it decides how many partitions to coalesce to, and what counts as a "native data source"? I couldn't find any mention of this feature in the SQL Programming Guide and Google was not helpful.--Daniel SiegmannSenior Software EngineerSecurityScorecard Inc.214 W 29th Street, 5th FloorNew York, NY 10001


-- Best Regards,Ayan Guha
-- ---Takeshi Yamamuro



  

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



couple naive questions on Spark Structured Streaming

2017-05-20 Thread kant kodali
Hi,

1. Can we use Spark Structured Streaming for stateless transformations just
like we would do with DStreams or Spark Structured Streaming is only meant
for stateful computations?

2. When we use groupBy and Window operations for event time processing and
specify a watermark does this mean the timestamp field in each message is
compared to the processing time of that machine/node and discard that
events that are late than the specified threshold? If we don't specify a
watermark I am assuming the processing time wont come into the picture. is
that right? Just trying to understand the interplay between processing time
and even time when we do even time processing.

Thanks!


unsubscribe

2017-05-20 Thread williamtellme123
unsubscribe

From: Abir Chakraborty [mailto:abi...@247-inc.com] 
Sent: Saturday, May 20, 2017 1:29 AM
To: user@spark.apache.org
Subject: unsubscribe

 

 



Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Takeshi Yamamuro
I think this document points to a logic here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418

This logic merge small files into a partition and you can control this
threshold via `spark.sql.files.maxPartitionBytes`.

// maropu


On Sat, May 20, 2017 at 8:15 AM, ayan guha  wrote:

> I think like all other read operations, it is driven by input format used,
> and I think some variation of combine file input format is used by default.
> I think you can test it by force a particular input format which gets ine
> file per split, then you should end up with same number of partitions as
> your dsta files
>
> On Sat, 20 May 2017 at 5:12 am, Aakash Basu 
> wrote:
>
>> Hey all,
>>
>> A reply on this would be great!
>>
>> Thanks,
>> A.B.
>>
>> On 17-May-2017 1:43 AM, "Daniel Siegmann" 
>> wrote:
>>
>>> When using spark.read on a large number of small files, these are
>>> automatically coalesced into fewer partitions. The only documentation I can
>>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>>
>>> "Automatic file coalescing for native data sources"
>>>
>>> Can anyone point me to documentation explaining what triggers this
>>> feature, how it decides how many partitions to coalesce to, and what counts
>>> as a "native data source"? I couldn't find any mention of this feature in
>>> the SQL Programming Guide and Google was not helpful.
>>>
>>> --
>>> Daniel Siegmann
>>> Senior Software Engineer
>>> *SecurityScorecard Inc.*
>>> 214 W 29th Street, 5th Floor
>>> New York, NY 10001
>>>
>>> --
> Best Regards,
> Ayan Guha
>



-- 
---
Takeshi Yamamuro


Re: SparkSQL not able to read a empty table location

2017-05-20 Thread Steve Loughran

On 20 May 2017, at 01:44, Bajpai, Amit X. -ND 
> wrote:

Hi,

I have a hive external table with the S3 location having no files (but the S3 
location directory does exists). When I am trying to use Spark SQL to count the 
number of records in the table it is throwing error saying “File s3n://data/xyz 
does not exist. null/0”.

select * from tablex limit 10

Can someone let me know how we can fix this issue.

Thanks


There isn't really a "directory" in S3, just a set of objects whose paths begin 
with a string. Try creating an empty file with an _ prefix in the directory; it 
should be ignored by Spark SQL but will cause the "directory" to come into being


unsubscribe

2017-05-20 Thread Abir Chakraborty